CS-TR-03-2: Learning Knowledge Bases for Information Extraction from Multiple Text Based Web Sites

Learning Knowledge Bases for Information Extraction from Multiple Text Based Web Sites

CS-TR-03-2

Authors: Xiaoying Gao, Mengjie Zhang
Source: GZipped PostScript (84kb); Adobe PDF (209kb)

This paper describes a learning/adaptive approach to automatically building a knowledge base for information extraction from text based web pages. A frame based representation is introduced to represent domain knowledge as knowledge unit frames. A frame learning algorithm is developed to automatically learn knowledge unit frames from training examples. Some training examples can be obtained by automatically parsing a number of tabular web pages in the same domain, which greatly reduced the time consuming manual work. This approach was investigated on ten web sites of real estate advertisements and car advertisements and nearly all the information was successfully extracted with very few false alarms. These results suggest that both the knowledge unit frame representation and the frame learning algorithm work well, domain specific knowledge base can be learned from training examples, and the domain specific knowledge base can be used for information extraction from flexible text-based semi-structured Web pages on multiple Web sites. The investigation of the knowledge representation on five other domains suggests that this approach can be easily applied to other domains by simply changing the training examples.

[Up to Computer Science Technical Report Archive: Home Page]