Authors: Xiaoying Gao, Mengjie Zhang, Peter Andreae
Source: GZipped PostScript (103kb); Adobe PDF (283kb)
This paper describes a domain independent approach for automatically constructing information extraction patterns for semi-structured web pages. Given a randomly chosen page from a web site of similarly structured pages, the system identifies a region of the page that has a regular ``tabular'' structure, and then infers an extraction pattern that will match the ``rows'' of the region and identify the data elements. The approach was tested on three corpora containing a series of tabular web sites from different domains and achieved a success rate of at least 80\%. A significant strength of the system is that it can infer extraction patterns from a single training page and does not require any manual labeling of the training page.