INDREX: In-database relation extraction
Zitatschlüssel DBLP:journals/is/KiliasLA15
Autor Torsten Kilias and Alexander Löser and Periklis Andritsos
Seiten 124–144
Jahr 2015
DOI 10.1016/j.is.2014.11.006
Journal Inf. Syst.
Jahrgang 53
Zusammenfassung Relation extraction transforms the textual representation of a relationship into to the relational model of a data warehouse. Early systems, such as SystemT by IBM or the open source system GATE solve this task with handcrafted rule sets that the system executes document-by-document. Thereby the user must execute a highly interactive and iterative process of reading a document, of expressing rules, of testing these rules on the next document and of refiningrules. Until now, these systems do neither leverage the full potentialof built-in declarative query languages nor the indexing and queryoptimization techniques of a modern RDBMS that would enable auser interactive rule refinementacross documentsand on theentirecorpus. We propose the INDREX system that enables a user forthe first time to describe corpus-wide extraction tasks in a declara-tive language and permits the user to run interactive rule refinementqueries. For enabling this powerful functionality we extend a stan-dard PostgreSQL with a set of white-box user-defined-functionsthat enable corpus-wide transformations from sentences into relations. We store the text corpus and rules in the same RDBMS that already holds domain specific structured data. As a result, (1) the user can leverage this data to further adapt rules to the targetdomain, (2) the user does not need an additional system for rule extraction and (3) the INDREX system can leverage the full power ofbuilt-in indexing and query optimization techniques of the underlaying RDBMS. In a preliminary study we report on the feasibility of this disruptive approach and show multiple queries in INDREXon the REUTERS-News’97 corpora.
