SearchIE: Activities and Results - 2019

Interactive Construction of Complex Query Models - 2019 Annual Report

Activities:
We have continued to explore automatic and interactive processes for retrieving sets of related names from text as an example of a task where machine learning might normally be applied, but in a context where it does not make sense to label vast amounts of data and thus where a search-based approach is more likely to be appropriate. To allow better comparison with techniques that leverage (vastly) more training data, we have focused on situations where the extraction target is some form of named entity such as a person or location (rather than a generic entity). We have used lists extracted from Wikipedia with some success, but have focused more on questions from the TREC “list QA” task that request a small subset of those broad named entities and so which are not likely to have lots of training data available (e.g., “boxers who have beaten Foreman”). As previously reported, we have explored scenarios that allow a (simulated) user to provide minimal feature information to achieve reasonable accuracy quickly and shown that a user should be able to do so. We extended that work to incorporate not just sample entities but also a sentence providing context and showed the importance of including that context (Sarwar et al., SIGIR LND4IR 2018). We have explored the impact of query expansion approaches on the task of searching for named entities, within the broader context of hierarchical and complex queries (Dalton et al., 2018). We have touched on neural network approaches to ranking problems to better understand efficiency issues as we consider applying them to our search-based “extraction” tasks; this work developed an efficient neural approximation that is at least as effective as a well-known forest-based learning to rank function (Cohen et al., SIGIR 2018). Finally, we extended our ideas a less traditional task: identifying poetry in a collection of scanned books (Foley, thesis 2019). Poetry is challenging because many assumptions about information extraction are broken in that domain: capitalization, sentence structure, spelling, and even line breaks are often used for effect in ways that typical training data does not anticipate. In our work, we identify a set of new features needed for poetry classification and retrieval, observe the importance of entities for search and classification (and the difficulty of finding entities in the unusual setting), and show the effectiveness of retrieval approaches in this context.

Significant Results:
Using a simulated user interaction, we showed that a context sentence including a flagged target entity provides a rich set of features for effective search-based extraction (Sarwar et al., SIGIR LND4IR 2018). We extended our work to finding and retrieving poetry and entities within poetry, forcing techniques that leverage new features and commonly used features in new ways (Foley, thesis 2019).