C. Boden

Paper "FactCrawl: A Fact Retrieval Framework for Full-Text Indices" accepted at the WebDB Workshop @ SIGMOD 2011

The paper "FactCrawl: A Fact Retrieval Framework for Full-Text Indices" by Christoph Boden, Alexander Löser, Christoph Nagel and Stephan Pieper has been accepted at the 14th International Workshop on the Web and Databases (WebDB 2011) co-located with the SIGMOD 2011 conference. ( PDF Version)

We present FactCrawl, a framework for retrieving structured, factual information leveraging the full-text index of a search engine. The framework applies an approximation algorithm to solve problem of retrieving all facts in a document collection using a minimal set of keywords while minimizing cost. The search engine is queried with automatically generated keywords, the results are re-ranked according to our fact score and documents are forwarded to a fact extractor. Keywords are determined using structural, syntactic, lexical and semantic information from sample documents. We estimate the fact score of a document by combining the observations of keywords in the document. We report results of an experimental evaluation over 20 fact extractors on a Reuters NIST corpus with 731,752 pages. Our experiments demonstrate that FactCrawl more than doubles recall in an online query scenario and nearly halves processing costs in an archive scenario, compared to existing approaches.