Connecting the Ephemeral and Archival Information Networks - Outcomes

Connecting the Ephemeral and Archival Information Networks

Key Outcomes

This project investigated the hypothesis that analyzing and searching either ephemeral or archival information can be enhanced using the connections between them by developing more effective retrieval models that use both types of information for a diverse set of search and text analysis tasks. The project also developed new methods of creating links between ephemeral and archival resources, new methods of acquiring ephemeral information efficiently from social media websites, and new reusable datasets for several problems. These resources enable other researchers to reproduce our research results and do their own research on these problems more easily. Parts of this research were done in collaboration with the Royal Melbourne Institute of Technology (RMIT) in Australia and the Universidade Nova de Lisboa in Portugal.

Much of the research focused on improving the accuracy of search engines for several tasks. When social media discussion links to a web page, the discussion provides concise and accurate descriptions of the web page; these descriptions can be used to improve the accuracy of a search engine. Tuning text summarization algorithms to mimic discussion in social media sites improves the quality of summaries produced for answer passage retrieval. These improvements are especially important for finding answers to questions that have complex answers instead of simple ‘factoid’ answers. Improved learning-to-rank algorithms and data fusion algorithms further improve retrieval accuracy. We have also studied the impact of combining social and archival data in the environment of conversational search, which has become increasingly important during the project.

Microblog sites such as Twitter are difficult to search effectively, because messages are brief and spelling varies. This research showed that archival sources such as news websites can be used to identify the temporal scope of a query and to provide greater context for the query, which improves search accuracy and efficiency. Often microblog posts can be organized into threads or conversations – sequences of microblog posts – which are a more effective unit of information for retrieval, topic classification, and sentiment analysis tasks.

Social media such as microblogs (e.g., Twitter) and community question answering sites (e.g., Yahoo! Answers) are important sources of ephemeral information. The project developed new methods of crawling social media websites more efficiently, and showed that a 1% sample of Twitter supports the same conclusions as more complete samples for some tasks. The text collections developed by the National Institute of Standards and Technologies’ annual TREC evaluations are the most widely-studied archival datasets. The project enriched existing TREC datasets by adding relevance annotations that support study of the connections between answers in social media and answer passages in web documents; and by adding links from the mention of an entity in document text to its entry in the Freebase knowledge base. It also developed new methods of automatically creating datasets for research on question-answering.

Our work in all of these areas has been ground-breaking and has had significant impact on both academia and industry. We have shown the value of integrating ephemeral and archival information to improve several types of search and text analysis. We have produced over 30 conference publications and had leading roles in organizing workshops on important aspects of the research. We have also produced new datasets and social media crawling techniques that make it easier for other researchers to study these problems. Finally, we have provided advanced research training to several MS and PhD students that have gone on to jobs in the US high tech industry.

This work is supported in part by the Center for Intelligent Information Retrieval (CIIR) and in part by the National Science Foundation (NSF IIS-1160894 and NSF IIS-1160862).
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.