Next: Experimental Results Up: Knowledge-based Information Agents Previous: Three Knowledge Bases

Information Extraction Engine

The information extraction engine utilizes the three categories of knowledge, extracts information from Web pages and saves the information as structured data. Figure shows how it works.

Figure: The Information Extraction Engine

The input to the system is the source file of a Web page with its site name (the Web site where the page is downloaded from). The output is a number of concepts, each concept consists of a number of knowledge units. The system first checks whether there are site specific patterns 1) If yes, then the page is parsed through the pattern matching function. There are two kinds of output: a) if the output consists of concepts or knowledge units, they are directly saved to structured data. b) if the output consists of fields (each field contains one or more knowledge units), then the fields are used as the input to knowledge unit extraction function, and then through a knowledge unit grouping function, to be parsed to structured data and saved. 2) If no, the page is parsed through three functions: page segmentation, knowledge unit extraction and knowledge unit grouping.

The four functions and the categories of knowledge they use are detailed as follows:

Pattern matching (F1): this function utilizes site specific knowledge, including site specific information extraction patterns for concepts, patterns for individual knowledge units and patterns for a group of knowledge units (fields), to parse a page into concepts, knowledge units and fields. The concepts and knowledge units are then saved as structured data and the fields are passed to the next function.
Page segmentation (F2): this function uses general knowledge of HTML tag usage and site specific knowledge of HTML usage, if available, to parse a page into ``lines'', which can be a line, a table row or an item of a list. It consists of four main steps: 1) representing a page as a character list, 2) rewriting the list using two tokens: tags and text, 3) classifying the tags into groups according to the page structure they represent including ``word'', ``line'', ``paragraph'', 4) segmenting a page into ``lines''.
Knowledge unit extraction (F3): this function uses either domain knowledge of knowledge unit identification and extraction or site specific information extraction patterns for knowledge units, if available, to extract knowledge units from each ``line''.
Knowledge unit grouping (F4): this function groups knowledge units into concepts and it uses two kinds of knowledge
- either general knowledge of HTML usage or site specific knowledge, if available, of HTML usage AND
- either domain knowledge of a concept hierarchy or site specific knowledge, if available, of a concept hierarchy.

Next: Experimental Results Up: Knowledge-based Information Agents Previous: Three Knowledge Bases

Xiaoying Gao
Tue Dec 11 16:30:56 NZDT 2001