A. Borusan

26.09.2012, 11 Uhr s.t. TU Berlin, EN building, seminar room EN 719 (7th floor), Einsteinufer 17, 10587 Berlin: "Two-Phase Entity Resolution" (Prof. Jeffrey Naughton, University of Wisconsin-Madison)

Entity resolution refers to a process that decides which pairs of records in a database refer to the same real world entities. This problem has received a great deal of attention, and a number of powerful techniques have been proposed. In our work we consider a simple but commonly used approach: developers are presented with pairs of records, and are asked to provide rules that determine whether the records refer to the same or different entities. Our contribution is that we consider a very rigid and apparently inflexible way of defining and applying the rules: in phase one, we only consider rules that indicate when records refer to different entities; in phase two, we only consider rules that indicate when records refer to the same entity. We show that this approach has a number of advantages, specifically that the rule applications within a phase are associative and commutative, and powerful automatic techniques are available to suggest pairs of records for developers to inspect. Perhaps surprisingly, despite the inflexible process for rule definition and application, the results on some benchmark datasets are encouraging. We will conclude with a discussion of future challenges and what might need to be done to evaluate how will this approach works in practice. This is joint work with Sun Chong and AnHai Doan.