A Collaborative Project with University of Massachusetts Amherst, Carnegie Mellon University, and RMIT University
University of Massachusetts Amherst:
W. Bruce Croft
Carnegie Mellon University:
Project Award Information
The increasing use of social applications has created a huge reservoir of information that is regarded as “ephemeral”. That is, much of the information is generated spontaneously by users as part of conversations with friends and acquaintances, with little thought about style, format, or the lasting value of the message. This can be compared to “archival” sources such as Wikipedia or news sites that contain information that is designed to be read over long periods by large audiences, and is generated by professional writers working within editorial guidelines. Most information that people generate and use, however, lies between these two extremes characterized by the time scale of the information, the quantity of the information, and the quality of information. Ephemeral and archival information are complementary, in that ephemeral information can provide different perspectives, additional detail, opinion, and real-time reactions about archival information, and archival information provides background and authoritative sources for conversations in social applications. The hypothesis that we propose to address is that, because of the context provided, searching either ephemeral or archival information will be enhanced using the connections between them. In other words, we will use the explicit and implicit links between the ephemeral and archival networks to improve the effectiveness of search that is targeted at social data, web data, or both. We will demonstrate the validity of our hypothesis using a range of existing TREC tasks focused on either social media search or web search. In addition, we will explore two new tasks, conversation search and aggregated social search, which can exploit the integrated network of ephemeral and archival information.
To pursue this research, we will build a testbed consisting of ephemeral and archival information from similar time periods. Our team has had considerable experience building collections and distributing open source software as part of the Lemur Project. The new testbed, ClueWeb12++, will add ephemeral data to the Lemur web collection ClueWeb12 currently being created. The ephemeral data will be a combination of blog, forum, and microblog data. We will also need to do additional data annotations to adapt the TREC tasks of interest (blog, microblog, and web) to the new testbed. We will develop new retrieval models for the TREC tasks based on important “contexts” in the ephemeral/archival network. For each of these contexts, we can derive features based on text content, structure, and metadata such as time, opinion, and persons. We will study both language model and linear feature-based approaches to developing effective ranking functions. We will also develop and evaluate models and representations that support more effective retrieval based on information “chunks” that are appropriate for social media. In particular, we will target an intermediate level of information unit in the form of “conversations” or “events”. Finally, we plan to study new approaches for generating “aggregated” rankings of many types of ephemeral and archival information, based on our previous work with aggregated search in the web context.
This research involves the definition and evaluation of new representations and models for searching both social and archival information. Although we and others have done previous research related to some aspects of this proposal, this is the first work to address the full possibilities of search that exploits all the connections and contexts created by bringing together the two “worlds” of information.
Broader impact: Research in this area will have a direct impact on the tools for searching social media and web data. Given the ever-increasing usage of these tools and the relatively poor current state-of-the-art for social search, this could have a very broad impact, both in the home and the office. Our groups have demonstrated that research in this area attracts undergraduates and women. We also have a long record of successful distribution of open-source software and research results to academia and industry.
Goals of This Project:
The hypothesis that we are addressing is that searching either ephemeral or archival information will be enhanced using the connections between them. In other words, we are studying the explicit and implicit links between the ephemeral and archival networks to improve the effectiveness of search that is targeted at social data, web data, or both. We are evaluating this research on a range of search tasks from TREC and other sources.
To pursue these goals, we have been building testbeds that consist of ephemeral and archival information from similar time periods. Our team has had considerable experience building collections and distributing open source software as part of the Lemur Project. The new testbeds add ephemeral data to the Lemur web collection ClueWeb12. The ephemeral data includes blog, forum, and microblog data. We have also started to do additional data annotations to adapt the tasks of interest (blog, microblog, and web) to the new testbeds. We are developing new retrieval models for archival text based on important “contexts” in the ephemeral/archival network. We are also developing and evaluating models and representations that support more effective retrieval for social media, including aggregated search combining social and archival media.
The initial period of this project has focused on developing the testbeds and evaluating the impact on some TREC tasks of the availability of the availability of the social data. In the past year, we have focused on finalizing our initial experiments on combining multiple sources of social and archival data, developing a new search task (answer passage retrieval) that can exploit social media, and starting research on a new approach to aggregated search involving social media.
Graduate Students Involved in This Project:
UMass Amherst CIIR:
Chia-Jung Lee (Ph.D. 2015)
UMass Amherst CIIR:
Lee, C. and Croft, W. B. , "Incorporating Social Anchors for Ad Hoc Retrieval," in the 10th International Conference in the RIAO series (OAIR 2013), Lisbon, Portugal, May 22-24, 2013.
Lee, C. and Croft, W. B., "Building a Web Test Collection using Social Media," in the 36th Annual ACM SIGIR Conference (SIGIR 2013), Dublin, Ireland, July 28-Aug 1, 2013, pp. 757-760.
Keikha, M., Park, J. and Croft, W. B. , "Evaluating Answer Passages using Summarization Measures," Proceedings of the 37th Annual ACM SIGIR Conference, Gold Coast, Australia, 6-11 July 2014, pp. 963-966.
Lee, C. and Croft, W. B., "Aggregating Heterogeneous Information Sources to Improve Search," UMass Amherst CIIR Technical Report, 2014.
Lee, C., Ai, Q., Croft, W. B. and Sheldon, D., "An Optimization Framework for Merging Multiple Result Lists," in Proceedings of Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, (CIKM 2015), pp. 303-312, 2015.
Chen, R., Lee, C. and Croft, W. B. , "On Divergence Measures and Static Index Pruning," Proceedings of the International Conference on Theoretical Information Retrieval (ICTIR 2015), Northampton, Massachusetts, September 27 - October 1, 2015, pp. 151-160.
Lee, C.J., "Exploiting Social Media Sources for Search," Fusion and Evaluation, UMass Amherst Thesis, 2015.
Yang, L., Ai, Q., Spina, D., Chen, R., Pang, L., Croft, W. B. , Guo, J. and Scholer, F., "Beyond Factoid QA: Effective Methods for Non-factoid Answer Sentence Retrieval," in The Proceedings of 38th European Conference on Information Retrieval (ECIR 2016), Padova, Italy, March 20-23, 2016, pp. 115-128.
Ai, Q., Yang, L., Guo, J. and Croft, W. B. , "Improving Language Estimation with the Paragraph Vector Model for Ad-hoc Retrieval," to appear in the Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 16), Pisa, Italy.
Kim, Y., Yeniterzi, R., and Callan, J., “Overcoming Vocabulary Limitations in Twitter Microblogs”, Proceedings of the Twenty-First Text REtrieval Conference (TREC 2012), National Institute of Standards and Technology, special publication 200-298, 2013.
S. Palakodety and J. Callan, “Query transformations for result merging”, Proceedings of the Twenty-Third Text REtrieval Conference (TREC 2014), National Institute of Standards and Technology, Gaithersburg, MD special publication 500-308, 2014.
Y. Wang, J. Callan, and B. Zheng, “Should we use the sample? Analyzing datasets sampled from Twitter's stream API”, to appear in ACM Transactions on the Web, 2016.
F. Martins, J. Magalhaes, and J. Callan. Barbara made the news: Mining the behavior of crowds for time-aware learning to rank. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp 667-676, 2016.
This work is supported in part by the Center for Intelligent Information Retrieval (CIIR) and in part by the National Science Foundation (NSF IIS-1160894 and NSF IIS-1160862).
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.