Text Reuse and Information Flow

Principal Investigator:
W. Bruce Croft, PI
croft@cs.umass.edu

Center for Intelligent Information Retrieval (CIIR)
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, MA 01003-9264

Project Summary

A text collection such as a newswire archive or Web crawl typically contains a great deal of repeated information. Different authors may each present versions of a story or event, the same event may get presented in different ways for different audiences, and the facts of an event may get recapitulated each time it is presented. Sometimes such presentations have little in common with each other; at other times one may be a copy of the other with minor edits. Given a topic of interest, then, a sufficiently extensive archive may contain much of the history of the topic. In particular, it can plausibly be used to identify when particular ideas or statements originated. It can also be used to check facts or the sources of statements made about the topic. Our interest is in exploring whether we can identify alternative versions of the same information in order to reconstruct the information flow.

The extent to which passages of text are considered similar to each other can be thought of as falling somewhere on a similarity spectrum. At one end of this spectrum is identity; two documents that are the same as each other in every way clearly have the highest level of similarity possible. The other end of the spectrum is the standard task of information retrieval: two documents are a match if they concern the same information need. In this project, we are exploring the similarity spectrum in the context of analyzing information flow or reuse. The objective of this project is to develop methods for tracking and analysis of facts and concepts through a text corpus. In order to create such methods, we need a similarity measure that can reliably identify passages or sentences that share concepts and facts, that is, where information has been reused. This level of semantic resemblance is significantly stronger that simple topical similarity, but does not impose the syntactic similarity constraints typical of copy detection systems.

We are studying a range of approaches to detecting reuse at the sentence level, and a range of approaches for combining sentence-level evidence into document-level evidence. We are also developing algorithms for inferring information flow from timelines, sources, and reuse measures. Given the importance of the Web as a source for detecting reuse, we also focus on techniques that can make efficient use of this huge but unwieldy resource. The research is being evaluated using a range of corpora, such as news, Web crawls, and blogs, in order to explore the dimensions of reuse and information flow in different situations.

This is one of the first projects to examine the “middle” of the similarity spectrum and the results and experimental methodology should enable other researchers to study this issue. Detecting information flow is also a new problem that will require novel solutions. Using the Web as a source will require algorithms for automatic query generation and result analysis that will substantially advance previous research.

The research and its outcomes will have a significant impact on the design of tools for information analysts, both in business and government. These tools will become an important part of the methods used to validate and assess information that comes from a variety of sources of differing reliability. Scientists could use the same tools to look for reuse in scientific literature. The same need is starting to arise in situations where ordinary people interact with the Web through search engines and blogs. By providing a tool that would enable someone to rapidly check statements and their sources, we would be placing more power in the hands of people to do their own assessment of information quality. The results of the research will be published in papers, and we will also distribute code and demonstrations through the popular Lemur/Indri toolkit that is developed jointly at the University of Massachusetts and CMU.

View details on the project's recent activities and findings.

Publications:

IR-570: (2007) Seo, J. and Croft, W. B. , "Homepage Search in Blog Collections," CIIR Technical Report.

IR-595: (2007) Balasubramanian, N., Allan, J. and Croft, W. B. , "A Comparison of Sentence Retrieval Techniques," in the Proceedings of the 30th Annual International ACM SIGIR Conference (SIGIR-07) , pp. 513-514.

IR-626: (2008) Seo, J. and Croft, W. B. , "Blog Site Search Using Resource Selection," Proceedings of the ACM 17th Conference on Information and Knowledge Management (CIKM), pp. 1053-1062.

IR-644: (2008) Bendersky, M. and Kurland, O., "Utilizing Passage-Based Language Models for Document Retrieval," Proceedings of ECIR 08, pp. 162 - 174.

IR-651: (2008) Bendersky, M. and Croft, W. B. , "Discovering Key Concepts in Verbose Queries," Proceedings of the 31st Annual ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 08), pp. 491-498.

IR-653: (2008) Seo, J. and Croft, W. B. , "Local Text Reuse Detection," Proceedings of the 31st Annual International ACM SIGIR Conference (SIGIR 2008), pp. 571-578.

IR-669: (2008) Cartright, M. and Bendersky, M., "Towards Scalable Data-Driven Authorship Attribution," CIIR Technical Report.

IR-695: (2008) Bendersky, M. and Croft, W. B. , "Finding Text Reuse on the Web," to appear in the Proceedings of the International Conference on Web Search and Data Mining (WSDM 2009) Barcelona, Spain - February 9-12, 2009.

IR-714: (2009) Xue, X., Dang, V. and Croft, W. B. , "Query Substitution based on N-gram Analysis," CIIR Technical Report.

IR-717: (2009) Bendersky, M., Smith, D. and Croft, W. B. , "Two-Stage Query Segmentation for Information Retrieval," Proceedings of the 32nd International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR '09), pp.810-811.

IR-720: (2009) Dang, V., Xue, X. and Croft, W. B. , "Context-based Quasi-Synonym Extraction," CIIR Technical Report.

IR-737: (2010) Huston, S. and Croft, W. B. , "Evaluating Verbose Query Processing Techniques," to appear in the Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, Geneva, Switzerland, July 18-23, 2010.

IR-738: (2009) Dang, V. and Croft, W. B. , "Query Reformulation Using Anchor Text," Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM) 2010, pp. 41-50.

IR-747: (2009) Cartright, M., Seo, J. and Lease, M., "UMass Amherst and UT Austin @ The TREC 2009 Relevance Feedback Track ," Notebook Proceedings of the Text Retrieval Conference (TREC 2009) Gaithersburg, Maryland, USA, Nov 17-20, 2009.

IR-752: (2010) Seo, J. and Croft, W. B. , "Geometric Representations for Multiple Documents," to appear in the Proceedings of the 33rd Annual ACM SIGIR Conference (SIGIR 2010)Geneva, Switzerland, July 19-23, 2010.

IR-758: (2010) Xue, X., Huston, S. and Croft, W. B. , "Selecting Subsets of Verbose Queries using Conditional Random Fields," CIIR Technical Report.

IR-759: (2010) Seo, J. and Croft, W. B. , "Unsupervised Estimation of Dirichlet Smoothing Parameters," to appear in the Proceedings of the 33rd Annual ACM SIGIR Conference (SIGIR 2010)Geneva, Switzerland, July 19-23, 2010.

IR-761: (2010) Chiu, S., Uysal, I. and Croft, W. B. , "Evaluating a Text Reuse Architecture for the Web," CIIR Technical Report.

This work is supported in part by the Center for Intelligent Information Retrieval (CIIR) and in part by the National Science Foundation (NSF IIS-0534383).