Next: Conclusion Up: Knowledge-based Information Agents Previous: Information Extraction Engine

Experimental Results

Our first information agent CASA (Classified Advertisement Search Agent) was built in 1997 [3] to search online real estate advertisements and help users to find rental property. It successfully searched for information automatically from multiple Web sites. It performed better than local search engines based on keyword matching.

An agent shell was developed based on the generalization of our first agent. The reusable agent framework, including the functions for Web accessing, information extraction and matching, forms the main part of the agent shell. The knowledge bases are completely separated from the framework and only the general knowledge base forms part of the agent shell. The other knowledge bases are kept separate from the shell. New information agents can be built by adding a new domain knowledge base and site specific knowledge base to the agent shell. The agent shell was successfully used to build a car classified advertisement search agent and a soccer score search agent [4].

Our experiments on building agents based on the framework show that:

Our agent shell can be used to build information agents for multiple domains and multiple sites. Our agent can be easily adapted or extended by modifying or extending its knowledge bases, while most current information agents are tailored to one specific domain and are difficult to scale up.
Our agents built using the agent shell accept user queries written in restricted natural languages, since the information extraction engine can extract specific requirements from the user query using the same method for extracting structured information from Web pages. The interface of our agents can be as simple as that of keyword search engines with one single text input field. The interface is easy to generate and easy to use. The interface does not need to change for different domains. This differs from current local search engines, in which different user interfaces need to be designed for different domains.
The information agents generated using our agent shell show better performance than local search engines based on keyword matching. The reason is that the information extraction engine transfers both the query and Web pages into structured data represented as a set of knowledge units. The matching is carried out between knowledge units which is more accurate than keyword matching.

In order to evaluate the agent's performance on information extraction from Web pages, we tested our agent on Web pages downloaded from over 100 Web sites. This paper will give some results based on our basic corpus. Our basic corpus was built by down-loading Web pages from 24 Web sites, 12 in the real estate advertisement domain and 12 in the car advertisement domain. Most of the Web sites are chosen from the top sites indexed by the search engine LookSmart at http://www.looksmart.com.

We use two parameters widely used in information extraction, precision and recall to evaluate our system. Precision is the percentage of correct responses out of all responses. Recall is the percentage of correct responses out of the total of correct answers. For each page, the information extraction answer keys are generated by manually correcting the output of our system. The performance of our system is evaluated by comparing the output with the answer keys.

In order to evaluate the performance of different steps of information extraction, we calculate precision and recall for the extraction of knowledge units, knowledge unit groups, and concepts.

knowledge unit. Each knowledge unit is correct if its name and value are the same as that of the manually generated answers. The precision and recall of knowledge units indicate the ability of extracting individual knowledge units from Web pages.
knowledge unit groups. We define a knowledge unit group as being correct when the correct knowledge units have been put in the right concept (group), ignoring false positive or false negative knowledge units. The precision and recall of knowledge unit groups indicate the ability of grouping knowledge units into concepts.
Concept. A concept is considered correct when its all knowledge units at the lower levels are correct, that is, the concept is perfect, all of its knowledge units are extracted and all extracted knowledge units are correct. The precision and recall of concepts indicate the ability of extracting a ``perfect concept''.

The results are given in Table . The results show that our agent performs well on multiple Web sites, including Web sites with flexible data formats such as data presented as free text in paragraphs.

table70
Table: Information Extraction Results

Next: Conclusion Up: Knowledge-based Information Agents Previous: Information Extraction Engine

Xiaoying Gao
Tue Dec 11 16:30:56 NZDT 2001