Connecting the Ephemeral and Archival Information Networks

A Collaborative Project with University of Massachusetts Amherst, Carnegie Mellon University, and RMIT University

University of Massachusetts Amherst:
W. Bruce Croft

Carnegie Mellon University:
Jamie Callan

RMIT University:
Mark Sanderson

Project Award Information

National Science Foundation Award Numbers: IIS-1160894 and NSF IIS-1160862
Duration: 08/01/12 - 07/31/16

Project Summary

The increasing use of social applications has created a huge reservoir of information that is regarded as “ephemeral”. That is, much of the information is generated spontaneously by users as part of conversations with friends and acquaintances, with little thought about style, format, or the lasting value of the message. This can be compared to “archival” sources such as Wikipedia or news sites that contain information that is designed to be read over long periods by large audiences, and is generated by professional writers working within editorial guidelines. Most information that people generate and use, however, lies between these two extremes characterized by the time scale of the information, the quantity of the information, and the quality of information. Ephemeral and archival information are complementary, in that ephemeral information can provide different perspectives, additional detail, opinion, and real-time reactions about archival information, and archival information provides background and authoritative sources for conversations in social applications. The hypothesis that we propose to address is that, because of the context provided, searching either ephemeral or archival information will be enhanced using the connections between them. In other words, we will use the explicit and implicit links between the ephemeral and archival networks to improve the effectiveness of search that is targeted at social data, web data, or both. We will demonstrate the validity of our hypothesis using a range of existing TREC tasks focused on either social media search or web search. In addition, we will explore two new tasks, conversation search and aggregated social search, which can exploit the integrated network of ephemeral and archival information.

To pursue this research, we will build a testbed consisting of ephemeral and archival information from similar time periods. Our team has had considerable experience building collections and distributing open source software as part of the Lemur Project. The new testbed, ClueWeb12++, will add ephemeral data to the Lemur web collection ClueWeb12 currently being created. The ephemeral data will be a combination of blog, forum, and microblog data. We will also need to do additional data annotations to adapt the TREC tasks of interest (blog, microblog, and web) to the new testbed. We will develop new retrieval models for the TREC tasks based on important “contexts” in the ephemeral/archival network. For each of these contexts, we can derive features based on text content, structure, and metadata such as time, opinion, and persons. We will study both language model and linear feature-based approaches to developing effective ranking functions. We will also develop and evaluate models and representations that support more effective retrieval based on information “chunks” that are appropriate for social media. In particular, we will target an intermediate level of information unit in the form of “conversations” or “events”. Finally, we plan to study new approaches for generating “aggregated” rankings of many types of ephemeral and archival information, based on our previous work with aggregated search in the web context.

This research involves the definition and evaluation of new representations and models for searching both social and archival information. Although we and others have done previous research related to some aspects of this proposal, this is the first work to address the full possibilities of search that exploits all the connections and contexts created by bringing together the two “worlds” of information.

Broader impact: Research in this area will have a direct impact on the tools for searching social media and web data. Given the ever-increasing usage of these tools and the relatively poor current state-of-the-art for social search, this could have a very broad impact, both in the home and the office. Our groups have demonstrated that research in this area attracts undergraduates and women. We also have a long record of successful distribution of open-source software and research results to academia and industry.


Lee, C. and Croft, W. B. , "Incorporating Social Anchors for Ad Hoc Retrieval," in the 10th International Conference in the RIAO series (OAIR 2013), Lisbon, Portugal, May 22-24, 2013.

Lee, C. and Croft, W. B., "Building a Web Test Collection using Social Media," in the 36th Annual ACM SIGIR Conference (SIGIR 2013), Dublin, Ireland, July 28-Aug 1, 2013, pp. 757-760.

Kim, Y., Yeniterzi, R., and Callan, J., “Overcoming Vocabulary Limitations in Twitter Microblogs”, Proceedings of the Twenty-First Text REtrieval Conference (TREC 2012), National Institute of Standards and Technology, special publication 500-298, 2013.

Keikha, M., Park, J. and Croft, W. B. , "Evaluating Answer Passages using Summarization Measures," Proceedings of the 37th Annual ACM SIGIR Conference, Gold Coast, Australia, 6-11 July 2014, pp. 963-966.

Lee, C., Ai, Q., Croft, W. B. and Sheldon, D., "An Optimization Framework for Merging Multiple Result Lists," in Proceedings of Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, pp. 303-312.

Yang, L., Ai, Q., Spina, D., Chen, R., Pang, L., Croft, W. B. , Guo, J. and Scholer, F., "Beyond Factoid QA: Effective Methods for Non-factoid Answer Sentence Retrieval," in The Proceedings of 38th European Conference on Information Retrieval (ECIR 2016), Padova, Italy, March 20-23, 2016, pp. 115-128.

Ai, Q., Yang, L., Guo, J. and Croft, W. B. , "Improving Language Estimation with the Paragraph Vector Model for Ad-hoc Retrieval," to appear in the Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 16), Pisa, Italy.

This work is supported in part by the Center for Intelligent Information Retrieval (CIIR) and in part by the National Science Foundation (NSF IIS-1160894 and NSF IIS-1160862).
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.