Interactive Information Retrieval in Semi-structured Data Sets

Traditionally, database research has focused on answering precise queries over homogeneously structured data. In contrast, information retrieval research has focused on answering imprecise queries over unstructured data. Because of the current ubiquity of semi-structured information, combinations of the two approaches are becoming increasingly relevant.

For example, given the query ''in which European countries can I pay with Euros?'' search engines on the Web traditionally aim at returning a ranked list of pointers to those web pages with the highest "relevance", that is, loosely speaking, the pages with the highest probability to satisfy the user's information need. More recently, systems are striving to deploy the structure present in the data and query to provide a more direct answer, e.g. a list of country names, with pointers to the Wikipedia pages describing those countries.

One approach to improve the performance of such systems is to develop better retrieval algorithms so that the resulting system directly provides better answers. High accuracy would however require the system to build up some "understanding" of the data, which is an extremely difficult problem, especially in an unconstrained domain such as the web.

A potentially more viable approach is to try to solve the query in a more interactive manner, so that the algorithms can benefit from user feedback on intermediate results to find the desired answer in multiple steps. The key difference is that the user refines the system's attempt at understanding the data in the process of using the "guessed" meaning. The original problem of understanding the information need is turned into a process where the system negotiates the solution strategy with the user, potentially making much better use of his or her capacity to understand the intermediate results than the system could.

This leads to the following research questions:

  1. What characteristics of the query and underlying data set determine which of the two approaches will lead the user to the desired result in the most efficient way.
  2. How to design a system that combines both approaches and unifies state of the art retrieval algorithms with effective interactive search interfaces.
  3. How to evaluate the performance of such a system on representative query sets and realistically large data sets.

The project is carried out in the Information Systems (INS) cluster, and brings together the research of Arjen P. de Vries (INS1) into entity ranking with the research of Jacco van Ossenbruggen (INS2) into user interface design for interacting with large linked data sets.