Datenschutzerklärung|Data Privacy
Impressum

12.04.2011
K. Dießel

Paper "Classification Algorithms for Relation Prediction" presented at DaLi Workschop (ICDE 2011) Deutsch

Auf dem DaLi Workshop der ICDE 2011 {1,2} wurde folgendes Paper {3}
veröffentlicht:

Classification Algorithms for Relation Prediction

Knowledge discovery from the Web is a cyclic process. In this paper we focus on the important part of transforming unstructured information from Web pages into structured relations. Relation extraction systems capture information from natural language text on Web pages, called Web text. However, extraction is quite costly and time consuming. Worse, many Web pages may not contain a textual representation of a relation that the extractor can capture. As a result many irrelevant pages are processed by relation extractors.

We propose a relation predictor to filter out irrelevant pages and substantially speed up the overall information extraction process. As a classifier, we trained a support vector machine (SVM). We evaluate pages on a sentence level, where each sentence is transformed into a token representation of shallow text features.

We evaluate our relation predictor on 18 different relation extractors. Extractors vary in their number of attributes and their extraction domain. Our evaluation corpus contains more than six million sentences from several hundred thousand pages. We report a prediction time of tens of milliseconds per page and
observe high recall across domains.

Our experimental study shows that the relation predictor effectively forwards only relevant pages to the relation extractor. We report a speedup of at least factor two while discarding only a minimal amount of relations. If only fixed amount e.g. 10% of the pages in the corpus are processed, the predictor drastically increases the recall by a factor of five on average.

{1} http://www.icde2011.org/
{2} http://dali2011.dia.uniroma3.it/
{3} https://infra.dima.tu-berlin.de/documente/2011_04_Classification_Algorithms_for_Relation_Prediction.pdf