CS-TR-03-3: Learning Information Extraction Patterns from Tabular Web Pages without Manual Labelling

Learning Information Extraction Patterns from Tabular Web Pages without Manual Labelling

CS-TR-03-3

Authors: Xiaoying Gao, Mengjie Zhang, Peter Andreae
Source: GZipped PostScript (103kb); Adobe PDF (283kb)

This paper describes a domain independent approach for automatically constructing information extraction patterns for semi-structured web pages. Given a randomly chosen page from a web site of similarly structured pages, the system identifies a region of the page that has a regular ``tabular'' structure, and then infers an extraction pattern that will match the ``rows'' of the region and identify the data elements. The approach was tested on three corpora containing a series of tabular web sites from different domains and achieved a success rate of at least 80\%. A significant strength of the system is that it can infer extraction patterns from a single training page and does not require any manual labeling of the training page.

[Up to Computer Science Technical Report Archive: Home Page]