Next: Information Extraction Engine Up: Knowledge-based Information Agents Previous: Agent Architecture

Three Knowledge Bases

Focusing on information extraction from semi-structured data, we summarize the main components of the three categories of knowledge as follows:

General Knowledge (G)
- The usage of HTML tags, particularly, the page structure levels (word, line, paragraph, page) the tags are linked to.
- The identification methods of basic data types such as tag, text, character, etc.
- Common sense knowledge such as related data are often presented together.
Domain Specific Knowledge (D)
- The concepts and the relationship of concepts. For example, in real estate domain, the concepts include ``real estate ad'', ``property'', ``suburb'', ``price'', ``size'', ``type'', etc. The concepts can be put in a hierarchy, for example, ``real estate ad'' consists of a number of ``property'', each property consists of ``suburb'', ``price'', ``size'', and ``type''. The concept in the last level (the atomic concepts) are called knowledge units (KU) in this paper.
- How to identify the knowledge units (atomic concepts) and how to extract the value of knowledge units. Its major components include the domain specific terminology (for example, in the real estate domain, a suburb database can be used to identify the Suburb of each property from online advertisements), and domain specific data formatting conventions (for example, the Price for renting a property is usually a ``$'', followed by a number, and a unit such as ``per week'' or ``per month'')
Site Specific Knowledge (S)
- Site specific knowledge of the interface of each Web site. Most semi-structured data is presented as the search results of local search engines. In order to interact with the local search engine, the system needs site specific knowledge of the interface of the local search engine. For example, if the interface is an HTML form, the system needs to know the access method ``Get'' or ``Post'', and the way to generate query strings.
- The information extraction patterns for fields (a group of knowledge units).
- Site specific information extraction patterns for concepts.
- Site specific usage of HTML tags
- Site specific concept hierarchy
- Site specific information extraction pattern for individual knowledge units.
  The site specific knowledge base does not have to be complete. All items except the first one are optional. Site specific knowledge is used for describing some special sites which can not be described by domain specific knowledge.

The three categories of knowledge have different priorities when they are used for information extraction. The priorities are given as follows:

Site specific knowledge (S)
Domain knowledge (D)
General knowledge (G)

During the information extraction process, the site specific knowledge has the highest priority and the general knowledge has the lowest. When there are conflicts between the knowledge, the higher priority knowledge overrides the lower priority knowledge. When we get a particular site, we search for site specific knowledge first. If some site specific knowledge is found, this knowledge is used instead of the associated knowledge in either the general knowledge base or domain knowledge base. For example, if a site specific information extraction pattern is found for a special knowledge unit, then this pattern is used for extracting the knowledge unit, instead of using the more general pattern in the domain knowledge base.

Next: Information Extraction Engine Up: Knowledge-based Information Agents Previous: Agent Architecture

Xiaoying Gao
Tue Dec 11 16:30:56 NZDT 2001