Information Retrieval Laboratory
Recent Papers
IR-866: (2011) Balasubramanian, N., "Query-Dependent Selection of Retrieval Alternatives," Ph.D. Dissertation, University of Massachusetts Amherst.
This thesis investigates query-dependent selection of retrieval alternatives for information retrieval systems. Information retrieval research focuses on developing query representations and retrieval models for ranking documents. However, no single query representation or retrieval model performs the best for all queries. Selecting the best representation or retrieval model for each query can yield improved performance. This thesis addresses an important question in dynamically combining retrieval alternatives: How to estimate the relative performance of the different alternatives for each query? We treat query-dependent selection as a general problem of selecting between the result sets of different alternatives. We develop a relative effectiveness estimation technique that uses retrieval-based features in a learning formulation to directly predict differences between the results sets. We apply this general technique to select between alternative reduced versions for long queries and to combine multiple ranking algorithms. We extend the selection problem to include efficiency constraints and develop easy-to-compute features for efficient combination of retrieval models. Finally, we provide an extensive analysis of selection performance to better understand when query-dependent selection can be useful.
IR-842: (2011) Seo, J., "Search Using Social Media Structures," Ph.D. Dissertation, University of Massachusetts Amherst.
Social applications on the Web have appeared as communication spaces for sharing knowledge and information. In particular, social applications can be considered valuable information sources because information in the applications is not only easily accessible but also revealing in that the information accrues via interactions between people. In this work, we address methods for finding relevant information in social media applications that use unique properties of these applications. In particular, we focus on three unique structures in social media: hierarchical structure, conversational structure, and social structure.
IR-834: (2011) Yi, X. "Discovering and Using Implicit Data for Information Retrieval," Ph.D. Dissertation, University of Massachusetts Amherst.
We present a language-modeling based general perspective for discovering implicit information and demonstrate how to use the discovered data for search. We investigate four specific IR challenges: (1) finding relevant records in semi-structured databases where many records contain incomplete or empty fields; (2) searching web pages that have little or no associated anchor text; (3) using click-through records in web query logs to help search pages that have no or very few clicks; and (4) discovering plausible geographic locations for web queries that contain no explicit geographic information.
IR-673: (2008) Allan, J., Carterette, B., Aslam, J., Pavlu, V., Dachev, B. and Kanoulas, E., "Million Query Track 2007 Overview," Proceedings of TREC 2007, November 6-9, 2007, National Institute of Standards and Technology, Gaithersburg, Maryland.
The Million Query (1MQ) track ran for the first time in TREC 2007. It was designed to serve two purposes. First, it was an exploration of ad-hoc retrieval on a large collection of documents. Second, it investigated questions of system evaluation, particularly whether it is better to evaluate using many shallow judgments or fewer thorough judgments.
IR-671:(2008) Kumaran, G., "Interactive Reformulation of Long Queries," Ph.D. dissertation, University of Massachusetts Amherst, May 2008.
We present new ways of interacting with a user based on query analysis and reformulation. Our goal is to not only improve retrieval performance but also help the user understand the retrieval process and collection she is searching. We do this by providing users information reflecting the potential impact their decisions will have on the retrieval process. This way, users can make more informed choices from the options presented to them by the retrieval system. Unlike most previous work in user interaction where a one-procedure-fits-all strategy was pursued, user interaction must be invoked only when there is potential for improvement. This is important as tedious user interaction can have an unfavorable impact on user experience. We present techniques for selective user interaction and show their utility in the context of two interaction techniques we have developed. Our results show that user interaction can be avoided in a vast number of cases without much deterioration in performance. User interaction can be made more productive by providing users with an optimally-sized set of high quality options. We present efficient techniques to determine such a set. When faced with a decision to interact with a user given a particular query, it is beneficial is determine the best interaction technique suited for that query. We solve this problem by obtaining implicit feedback from the user. By utilizing all the interaction-related techniques described in this thesis, we show through simulations and user studies that users can obtain better performance with less effort.
IR-669: (2008) Cartright, M. and Bendersky, M., "Towards Scalable Data-Driven Authorship Attribution," CIIR Technical Report, 2008.
Traditional authorship attribution approaches have made attempts at capturing features that were designed heuristically -- researchers guessed at which aspects of language would best separate one author from another and then performed experiments to see how valid their assumptions were. While this approach has met some success, it also proves to be unscalable -- most test collections to date have been on the size of 10 or less authors, which in the age of internet-style publication is an unrealistically low quantity. We believe that this approach to feature selection for authorship attribution adds unnecessary complexity to what the task really seems to be: a multi-class classification problem, and one where the most useful features can be easily discovered using a standard dimensionality reduction technique. We demonstrate the use of such a technique to dramatically reduce the number of used features for authorship attribution using an implementation of Support Vector Machines.
IR-665: (2008) Carterette, B., "Evaluation Measures for Preference Judgments," Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 08), Singapore, July 20-24, 2008.
There has been recent interest in collecting user or assessor preferences, rather than absolute judgments of relevance, for the evaluation or learning of ranking algorithms. Since measures like precision, recall, and DCG are defined over absolute judgments, evaluation over preferences will require new evaluation measures that explicitly model them. We describe a class of such measures and compare absolute and preference measures over a large TREC collection.
IR-663: (2008) Bendersky, M. and Kurland, O., "Re-ranking search results using document-passage graphs," Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 08), Singapore, July 20-24, 2008.
We present a novel passage-based approach to re-ranking documents in an initially retrieved list so as to improve precision at top ranks. While most previous work on passage-based document retrieval ranks a document based on the query-similarity of its constituent passages, our approach leverages information about the centrality of the document passages with respect to the initial document list. Passage centrality is induced over a bipartite document-passage graph, wherein edge weights represent document-passage similarities. Empirical evaluation shows that our approach yields effective re-ranking performance; furthermore, the performance is superior to that of previously proposed passage-based document ranking methods.
IR-657: (2008) Lin, J. and Smucker, M., "How Do Users Find Things with PubMed? Towards Automatic Utility Evaluation with User Simulations," Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 08), Singapore, July 20-24, 2008.
In the context of document retrieval in the biomedical domain, this paper explores the complex relationship between the quality of initial query results and the overall utility of an interactive system. We demonstrate that a content-similarity browsing tool can compensate for poor retrieval results, and that the relationship between retrieval performance and overall utility is non-linear. Arguments are advanced with user simulations, which characterize the relevance of documents that a user might encounter with different browsing strategies. With broader implications to IR, this work provides a case study of how user simulations can be exploited as a formative tool for automatic utility evaluation. Simulation-based studies provide researchers with an additional evaluation tool to complement interactive and Cranfield-style experiments.
IR-655: (2008) Smucker, M., Allan, J. and Dachev, B., "Human Question Answering Performance using an Interactive Information Retrieval System," CIIR Technical Report, 2008.
Every day, people widely use information retrieval (IR) systems to find documents that answer their questions. Compared to these IR systems, question answering (QA) systems aim to speed the rate at which users find answers by retrieving answers rather than documents. To better understand how IR systems compare to QA systems, we measured the performance of humans using an interactive IR system to answer questions. We conducted our experiments within the framework of the TREC 2007 complex, interactive question answering (ciQA) track. We found that the average QA system was comparable to humans using an IR system. Our results also show that for some users IR systems can be powerful question answering systems. After only 5 minutes of usage per question, one user of the IR system obtained an average F (beta = 3) score of 0.800, which outperformed the best QA system by 27% and the average QA system by 40%. After 10 minutes of usage, 5 of 8 users of the IR system obtained a higher performance than the average QA system. To achieve superior performance, future QA systems should combine the flexibility and precision of IR systems with the ease-of-use and recall advantages of QA systems.
IR-654: (2008) Lee, K., Croft, W. B. and Allan, J., "A Cluster-Based Resampling Method for Pseudo- Relevance Feedback," Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 08), Singapore, July 20-24, 2008.
Typical pseudo-relevance feedback methods assume the topretrieved documents are relevant and use these pseudo-relevant documents to expand terms. The initial retrieval set can, however, contain a great deal of noise. In this paper, we present a clusterbased resampling method to select better pseudo-relevant documents based on the relevance model. The main idea is to use document clusters to find dominant documents for the initial retrieval set, and to repeatedly feed the documents to emphasize the core topics of a query. Experimental results on large-scale web TREC collections show significant improvements over the relevance model. For justification of the resampling approach, we examine relevance density of feedback documents. A higher relevance density will result in greater retrieval accuracy, ultimately approaching true relevance feedback. The resampling approach shows higher relevance density than the baseline relevance model on all collections, resulting in better retrieval accuracy in pseudo-relevance feedback. This result indicates that the proposed method is effective for pseudo-relevance feedback.
IR-653: (2008) Seo, J. and Croft, W. B. , "Local Text Reuse Detection," Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 08), Singapore, July 20-24, 2008.
Text reuse occurs in many different types of documents and for many different reasons. One form of reuse, duplicate or near-duplicate documents, has been a focus of researchers because of its importance in Web search. Local text reuse occurs when sentences, facts or passages, rather than whole documents, are reused and modified. Detecting this type of reuse can be the basis of new tools for text analysis. In this paper, we introduce a new approach to detecting local text reuse and compare it to other approaches. This comparison involves a study of the amount and type of reuse that occurs in real documents, including TREC newswire and blog collections.
IR-652: (2008) Xue, X., Croft, W. B. and Jeon, J., "Retrieval Models for Question and Answer Archives," Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 08), Singapore, July 20-24, 2008.
Retrieval in a question and answer archive involves finding good answers for a user's question. In contrast to typical document retrieval, a retrieval model for this task can exploit question similarity as well as ranking the associated answers. In this paper, we propose a retrieval model that combines a translation-based language model for the question part with a query likelihood approach for the answer part. The proposed model incorporates word-to-word translation probabilities learned through exploiting different sources of information. Experiments show that the proposed translation based language model for the question part outperforms baseline methods significantly. By combining with the query likelihood language model for the answer part, substantial additional effectiveness improvements are obtained.
IR-651: (2008) Bendersky, M. and Croft, W. B. , "Discovering Key Concepts in Verbose Queries," Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 08), Singapore, July 20-24, 2008.
Current search engines do not, in general, perform well with longer, more verbose queries. One of the main issues in processing these queries is identifying the key concepts that will have the most impact on effectiveness. In this paper, we develop and evaluate a technique that uses query-dependent, corpus-dependent, and corpus-independent features for automatic extraction of key concepts from verbose queries. We show that our method achieves higher accuracy in the identification of key concepts than standard weighting methods such as inverse document frequency. Finally, we propose a probabilistic model for integrating the weighted key concepts identified by our method into a query, and demonstrate that this integration significantly improves retrieval effectiveness for a large set of natural language description queries derived from TREC topics on several newswire and web collections.
IR-650: (2008) Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. and Allan, J., "Evaluation Over Thousands of Queries," Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 08), Singapore, July 20-24, 2008.
Information retrieval evaluation has typically been performed over several dozen queries, each judged to near-completeness. There has been a great deal of recent work on evaluation over much smaller judgment sets: how to select the best set of documents to judge and how to estimate evaluation measures when few judgments are available. In light of this, it should be possible to evaluate over many more queries without much more total judging effort. The Million Query Track at TREC 2007 used two document selection algorithms to acquire relevance judgments for more than 1,800 queries. We present results of the track, along with deeper analysis: investigating tradeoffs between number of queries and number of judgments shows that, up to a point, evaluation over more queries with fewer judgments is more cost-effective than fewer queries with more judgments; in fact, total assessor effort can be reduced by 95\%. A consequence of this is smaller test collections, and with good estimates of confidence in the evaluation, we can further identify the systems that are most in need of additional judgments.
IR-647: (2008) Kumaran, G. and Allan, J., "Effective and Efficient User Interaction for Long Queries," Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 08), Singapore, July 20-24, 2008.
Handling long queries can involve either pruning the query to retain only the important terms (reduction), or expanding the query to incorporate related concepts (expansion). While automatic techniques to do so exist, roughly 25% performance improvements in terms of MAP have been realized in past work through interactive variants. We show that selectively reducing or expanding a query leads to an average improvement of 51% in MAP over the baseline for standard TREC test collections. We demonstrate how user interaction can be used to achieve this improvement. Most interaction techniques present users with a fixed number of options for all queries. We achieve improvements by interacting less with the user, i.e, we present techniques to identify the optimal number of options to present to a user, resulting in an interface with an average of 70% fewer options to consider. Previous algorithms supporting interactive reduction and expansion are exponential in nature. To extend their utility to operational environments, we present techniques to make the complexity of the algorithms polynomial. We finally present an analysis of long queries that continue to exhibit poor performance in spite of our new techniques.
IR-646: (2008) Bekkerman, R., "Combinatorial Markov Random Fields and their Applications to Information Organization," Ph.D. dissertation, University of Massachusetts Amherst, 2008.
We propose a new type of undirected graphical models called a Combinatorial Markov Random Field (Comraf) and discuss its advantages over existing graphical models. We develop an efficient inference methodology for Comrafs based on combinatorial optimization of information-theoretic objective functions; both global and local optimization schema are discussed. We apply Comrafs to multi-modal clustering tasks: standard (unsupervised) clustering, semi-supervised clustering, interactive clustering, and one-class clustering. For the one-class clustering task, we analytically show that the proposed optimization method is optimal under certain simplifying assumptions. We empirically demonstrate the power of Comraf models by comparing them to other state-of-the-art machine learning techniques, both in text clustering and image clustering domains. For unsupervised clustering, we show that Comrafs consistently and significantly outperform three previous state-of-the-art clustering techniques on six real-world textual datasets. For semi-supervised clustering, we show that the Comraf model is superior to a well-known constrained optimization method. For interactive clustering, Comraf obtains higher accuracy than a Support Vector Machine, trained on a large amount of labeled data. For one-class clustering, Comrafs demonstrate superior performance over two previously proposed methods. We summarize our thesis by giving a comprehensive recipe for machine learning modeling with Comrafs.
IR-645: (2008) Diaz, F., "Autocorrelation and Regularization of Query-based Information Retrieval Scores ," Ph.D. dissertation, University of Massachusetts Amherst, 2008.
Query-based information retrieval refers to the process of scoring documents given a short natural language query.! Query-based information retrieval systems have been developed to support searching diverse collections such as the world wide web, personal email archives, news corpora, and legal collections.!!This thesis is motivated by one of the tenets of information retrieval: the cluster hypothesis. We define a design principle based on the cluster hypothesis which states that retrieval scores should be locally consistent. We refer to this design principle as score autocorrelation. Our experiments show that the degree to which retrieval scores satisfy this design principle correlates positively with system performance. We use this result to define a general, black box method for improving the local consistency of a set of retrieval scores. We refer to this process as local score regularization. We demonstrate that regularization consistently and significantly improves retrieval performance for a wide variety of baseline algorithms. Regularization is closely related to classic techniques such as pseudo-relevance feedback and cluster-based retrieval. We demonstrate that the effectiveness of these techniques may be explained by their regularizing behavior. We argue that regularization should be adopted either as a generic post-processing step or as a fundamental design principle for retrieval models.
IR-644: (2008) Bendersky, M. and Kurland, O., "Utilizing Passage-Based Language Models for Document Retrieval," Proceedings of ECIR 08, pp.162 - 174.
We show that several previously proposed passage-based document ranking principles, along with some new ones, can be derived from the same probabilistic model. We use language models to instantiate specific algorithms, and propose a passage language model that integrates information from the ambient document to an extent controlled by the estimated document homogeneity. Several document-homogeneity measures that we propose yield passage language models that are more effective than the standard passage model for basic document retrieval and for constructing and utilizing passage-based relevance models; the latter outperform a document-based relevance model. We also show that the homogeneity measures are effective means for integrating document-query and passage-query similarity information for document retrieval.
IR-562: (2008) Metzler, D., Strohman, T. and Croft, W. B. , "A Statistical View of Binned Retrieval Models," Proceedings of ECIR 2008, pp. 175-186.
Parameterized retrieval models are the most commonly studied type of information retrieval model. These models assign term weights according to a parameterized function, composed of standard information retrieval features, such as term frequency, inverse document frequency, and document length. The recently proposed document-centric impact model has suggested that very simple term weighting functions can be as effective as the more complex weighting functions. In this work, we describe a probabilistic model that is inspired by the document-centric impact model. In addition, we propose novel techniques for binning terms and estimating probabilities discriminatively. We analyze various aspects of our model and show that it is competitive in terms of effectiveness and efficiency when evaluated against several TREC data sets.