Next: Information Extraction Engine
Up: Knowledge-based Information Agents
Previous: Agent Architecture
Focusing on information extraction from semi-structured data, we summarize
the main components
of the three categories of knowledge as follows:
- General Knowledge (G)
- The usage of HTML tags, particularly, the page structure levels
(word, line, paragraph, page)
the tags are linked to.
- The identification methods of basic data types such as tag, text,
character, etc.
- Common sense knowledge such as related data are often presented
together.
- Domain Specific Knowledge (D)
- The concepts and the relationship of concepts.
For example, in real estate domain, the concepts include ``real estate
ad'', ``property'',
``suburb'', ``price'', ``size'', ``type'', etc. The concepts can be put in
a hierarchy, for
example, ``real estate ad'' consists of a number of ``property'', each
property consists of
``suburb'', ``price'', ``size'', and ``type''. The concept in the last
level (the atomic
concepts) are called knowledge units (KU) in this paper.
- How to identify the knowledge units (atomic concepts) and how to
extract the value of
knowledge units. Its major components include the domain specific
terminology (for example,
in the real estate domain, a suburb database
can be used to identify the Suburb of each property from online
advertisements), and
domain specific data formatting conventions (for example, the Price
for renting a
property is usually a ``$'', followed by a number, and a unit such as
``per week'' or ``per
month'')
- Site Specific Knowledge (S)
The three categories of knowledge have different priorities when they are
used for information
extraction. The priorities are given as follows:
- Site specific knowledge (S)
- Domain knowledge (D)
- General knowledge (G)
During the information extraction process, the site specific knowledge has
the highest priority
and the general knowledge has the lowest. When there are conflicts between
the knowledge, the
higher priority knowledge overrides the lower priority knowledge. When we
get a particular site,
we search for site specific knowledge first. If some site specific
knowledge is found, this
knowledge is used instead of the associated knowledge in either the general
knowledge base or
domain knowledge base. For example, if a site specific information
extraction pattern is found
for a special knowledge unit, then this pattern is used for extracting the
knowledge unit,
instead of using the more general pattern in the domain knowledge base.
Next: Information Extraction Engine
Up: Knowledge-based Information Agents
Previous: Agent Architecture
Xiaoying Gao
Tue Dec 11 16:30:56 NZDT 2001