Supporting Effective Access through User- and Topic-Based Language Models

W.B. Croft, J. Allan
University of Massachusetts, Amherst

N.J. Belkin
Rutgers University

Contact Information

W. Bruce Croft
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-4610

James Allan
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-4610

Nicholas J. Belkin
School of Communication, Information and Library Studies
Rutgers University
4 Huntington Street
New Brunswick, NJ 08901-1071

Project Award Information

National Science Foundation Award Number: IIS-9907018

Duration: 10/01/2000 -- 09/30/2004


retrieval, language models, user models, topic models, query models, evaluation, interaction

Project Summary/Activities

In this project (called Mongrel), we were concerned with developing approaches to user and domain or topic modeling that would significantly improve the effectiveness of information retrieval. This approach is based on Ponte and Croft’s work on language models for information retrieval (Ponte and Croft, 1998). In this work, it is assumed that associated with every document or group of documents there are one or more probability distributions that model how the text in the document can be generated. This generative model is quite different from the standard probabilistic classification models described in Van Rijsbergen (1979) and has the following advantages:

a. With this model, we are much closer to finding a common basis to information retrieval and other language technologies, such as speech recognition and information extraction. Having such a basis would allow us to explore more sophisticated integration of these technologies, and to exploit powerful statistical techniques developed in any of the applications.

b. The model allows us to provide formal descriptions of a number of processes in information retrieval that were previously ad-hoc. These include so-called tf.idf weights, clustering, phrase matching, query expansion, cross-lingual retrieval, and distributed retrieval.

c. Experiments with language models have produced very encouraging results, with quite simple models equaling or surpassing the performance of the best systems based on other approaches.

d. Language models can be viewed as models of the important topics that are covered by a document or group of documents. They also have the potential of representing the topics of interest to a user or group of users.

It is this last feature of language models that we have investigated in this project. We believe that we have shown that language models provide more rigorous and effective approaches for capturing the ideas and concepts described in earlier work on user modeling such as Belkin et al (1982) and Daniels et al (1985). This earlier work suggested the importance of capturing word and phrase associations to capture a user’s context and to represent domains, but the approaches used were ad-hoc, the testbeds were very limited, and the general information infrastructure at the time was not conducive to user experiments. The combination of the language modeling approach, the large testbeds developed for TREC, and the new Internet infrastructure provided new opportunities to address the critical problem of user and domain modeling.

Substantial progress on all project goals was made in the last year of the project, resulting in a large number of publications. Of particular interest is the work related to topic models constructed using clustering, work on implicit sources of evidence as indicators of document preference, and the high-accuracy retrieval research that has been evaluated in the TREC HARD track and with other data.

The Mongrel project combined the expertise and experience of the University of Massachusetts group in the development and testing of information retrieval models and systems, with that of the Rutgers University group in user modeling and user studies in interactive systems. A full description of the activities and their corresponding findings are presented below.


A topic-based language model identifies characteristics of the language that are used in a particular domain or topic while a user-based language model identifies characteristics of the language that a user or a group of users associate with a particular domain or topic.

As an example, consider the following information needs that indicate a user ’s interest in Java:

a. I want to find the least expensive way to vacation in Java
b. I need to know where Java is
c. I need to write a survey of the current political situation in Java
d. I want to learn the Java programming language
e. I want to buy some dark roast Java coffee.

Our premise was that documents that are relevant to each of these kinds of queries will have different language characteristics associated with them. One collaborative effort between the University of Massachusetts Amherst and Rutgers University focused on:

* identifying the language characteristics of these types of needs (user),
* identifying the language characteristics of the groups of documents that satisfy each of these types of needs (topic), and
* matching these needs and document clusters accordingly.

We evaluated the automatic creation of personal topic models using two language model-based clustering techniques. The results of these methods were compared with user-defined topic classes of web pages from personal web browsing histories from a 5-week period. The histories and topics were gathered during a naturalistic case study of the online information search and use behavior of two users. Users in the initial study undertaken at Rutgers were provided with laptop computers and their activities were monitored with logging and evaluation software and online questionnaires. The users agreed to use the laptops for all their information searching and general computing activities during the study. In this study, we also investigated the effectiveness of using display time and retention behaviors as implicit evidence for weighting documents during topic model creation. Results, reported in Kelly et al (2004), show that agglomerative techniques – specifically, average-link clustering – provide the most effective methodology for building topic models while ignoring topic evidence and implicit evidence. A second study was conducted with seven new users over a period of 3.5 months. The results of this study, reported in Kelly (2004), indicate that, in order for display time and retention behaviors to be useful sources of implicit evidence, they need to be interpreted with respect to individual differences, and the nature of the tasks leading to information seeking.

In other work at UMass Amherst, we studied cluster-based retrieval using topic language models. Cluster-based retrieval is based on the hypothesis that similar documents will match the same information needs (vanRijsbergen, 1979). In our experiments, we showed, for the first time, that cluster-based retrieval can perform consistently across collections of realistic size, and significant improvements over document-based retrieval can be obtained in a fully automatic manner and without relevance information provided by human. We believe that the most important reason that accounts for our results is that topic language models provide a principled way for exploring the document-cluster relationships, and can factor this directly into the retrieval model. The second reason is that language models reserve free parameters for smoothing and allow for the use of sophisticated smoothing techniques, which may better capture the characteristics of clusters and documents than previously used retrieval methods. This work was described in Liu and Croft (2004).

In another part of the project, UMass explored using simple translation topic models for sentence retrieval. We demonstrated the framework using TREC 2003 data, and showed that it performs better than retrieval based on query likelihood, and on par with other systems, as reported in Murdock and Croft (2004). In a project concerned with high accuracy retrieval techniques (Shah and Croft 2004), we presented work aimed at achieving high accuracy in ad-hoc document retrieval by incorporating approaches from question answering. We focused on getting the first relevant result as high as possible in the ranked list by using the mean reciprocal rank (MRR) of the first relevant result rather than using traditional precision and recall. The experiments done on TREC 2003 data provide support for the approach of using MRR and incorporating QA techniques for getting high accuracy in ad-hoc retrieval task.

In the TREC 2003 and 2004 HARD tracks (Allan 2003) we explored the ability of a system to use information about a user to improve the effectiveness of document retrieval. Specifically, we used metadata about the user and the query to improve the language model for relevant documents and focus on material that is useful for that user at that time. At Rutgers, work in the TREC HARD track in 2003 and 2004 concentrated completely upon using metadata about the user which could, in principal, have been determined by observing previous or current searching behaviors, in order to improve retrieval effectiveness. Results in 2003 were mixed (Belkin, et al. 2004), primarily because training data was insufficient. However, use of related texts for query expansion resulted in improved performance, as did one version of taking account of desired genre of documents. In TREC 2004 (Belkin, et al., in press), we found that taking account of the user’s familiarity with the topic by relating level of familiarity to readability of documents, and the extent of use of concrete concepts in documents, resulted in improved retrieval performance.

In the TREC novelty tracks (2002 and 2004) we explored a wide range of models for finding both relevant and, more interestingly, novel material within a small set of documents. We combined models of relevance with models of "already seen" to find material that was on topic but not redundant. In 2002 we were one of the top two performing systems on this task (Allan et al 2003); results are not yet in for 2004.

We have invested considerable effort in developing methods that automatically recognize whether or not a query is ambiguous or otherwise vague and therefore in need of improvement, either by soliciting more information from the user or by automatic query enhancement techniques such as query expansion (Cronen-Townsend et al 2004). In addition, we have investigated whether person-specific models can be extracted from text in order to better determine when that person is relevant to a query or to otherwise interconnect people (Raghavan et al 2003; Allan and Raghavan 2003).

Substantial work was carried out at Rutgers on the use of searcher behaviors prior to and during searching activities as implicit sources of evidence for determining document usefulness. This work has resulted in one Ph.D. dissertation (Kelly, 2004), one conference paper (Kelly and Belkin, 2004) and a workshop presentation (Belkin, Muresan and Zhang, 2004) to date, with more publications under submission or in preparation. The data for this part of the project were collected over a three-month period, during which all computing and searching activities of seven subjects were logged. In addition, the subjects evaluated the usefulness of a significant sample of all of the pages that had been retrieved during their searching activities, with respect to the task which led to the searching activity, and the specific topic of the search. Results indicate that, although display time and retention of a page may be related to its usefulness, in order for this measure to be accurately applied, it must be tailored to the specific user, and moderated according to the type of task in which the user is engaged.

Rutgers and UMass continued their collaborative research on identifying documents relevant to fact- and process-oriented questions. The data gathering and analysis portion of this research has been completed, and the results have been written up for submission as a journal article. The results of this line of research indicate that: it is indeed possible to automatically distinguish between these two types of questions; and, it is possible to distinguish between documents that are more likely to be relevant to each of the two different types of questions, based on formal characteristics of those documents.


AbdulJaleel, N., Corrada-Emmanuel, A., Li, Q., Liu, X., Wade, C. and Allan, J.,, "UMass at TREC 2003: HARD and QA (Notebook version)", in Proceedings of TREC 2003, NIST special publication number 500-255, vol. , (2003), p. 715. Published

Allan, J, Bolivar, A. and Wade, C.,, "Retrieval and Novelty Detection at the Sentence Level", Proceedings of SIGIR ?03 Conference, vol. , (2003), p. 314. Published

Allan, J. and Raghavan, H., "A Probabilistic model of named entities: You are what they say you are", CIIR Technical Report, vol. , (2003), p. 1. Published

Allan, J. and Raghavan, H.,, "Entity Models: Construction and Applications", CIIR Technical Report, vol. , (2003), p. 1. Published

Allan, J., "HARD Track Overview in TREC 2003: High Accuracy Retrieval from Documents", Proceedings of TREC 2003, NIST special publication number 500-255, vol. , (2003), p. 24. Published

Allan, J., "HARD Track Overview in TREC 2004 (Notebook): High Accuracy Retrieval from Documents", TREC 2004, Gaithersburg, MD, November 2004, vol. , (2004), p. .

Allan, J. and Raghavan, H. "Using Part-of-speech Patterns to Reduce Query Ambiguity", Proceedings of SIGIR 2002, vol. , (2002), p. 307. Published

Belkin, N.J., Chaleva, I., Cole, M., Li, Y.-L., Liu, L., Liu, Y.-H., Y.-H., Muresan, G., Smith, C.L., Sun, Y., Yuan, X.-J., Zhang, X.-M., "Rutgers' HARD Track Experiences at TREC 2004", In E.M. Voorhees & L.P. Buckland (Eds.). The Thirteenth Text REtrieval Conference, TREC 2004. Gaithersburg, MD: NIST., vol. , (2004), p. . Submitted

Belkin, N.J., Cool, C., Kelly, D., Kim, G., Kim, J.-Y., Lee, H.-J., Muresan, G., Tang, M.-C., & Yuan, X.-J. (2003b), "Interaction and query length in interactive retrieval", In D. Harman & E. Voorhees (Eds.), The Eleventh Text Retrieval Conference (TREC2002). Washington, D.C.: GOP., vol. , (2003), p. 539. Published

Belkin, N.J., Cool, C., Kelly, D., Lee, H.-J., Muresan, G., Tang, M.-C., & Yuan, X.-J. (2003a), "Query length in interactive information retrieval.", In Proceedings of the 26th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR '03), vol. , (2003), p. 205. Published

Belkin, N.J., Kelly, D., Lee, H.-J., Li, Y.-L., Muresan, G., Tang, M.-C.., Yuan, X.-J., Zhang, X.-M., "Rutgers' HARD and Web Interactive Track Experiences at TREC 2003", The Twelfth Text REtrieval Conference, TREC 2003, vol. , (2003), p. 532. Published

Belkin, N.J., Muresan, G. & Zhang, X.-M., "Measuring Web search effectiveness: Rutgers at interactive TREC", WF12 Measuring Web Effectiveness: The User Perspective, Workshop held at WWW 2004; In C. Watters & A. Spink (Eds.), vol. , (2004), p. 1. Published

Belkin, N.J., Muresan, G. and Zhang, X.-M., "Using user's context for IR personalization", Presented at the SIGIR 2004 Workshop on Information Retrieval in Context, Sheffield, England, 29 July 2004, vol. , (2004), p. 1. Published

Croft, W.B. , Cronen-Townsend, S., and Lavrenko, V., "Relevance Feedback and Personalization: A Language Model Perspective", Proceedings of the Joint DELOS-NSF Workshop on Personalization and Recommender Systems in Digital Libraries, vol. , (2001), p. 49-54. Published

Cronen-Townsend, S., Zhou, Y., and Croft, W.B., "A Language Modeling Framework for Selective Query Expansion", CIIR Technical Report, vol. , (2004), p. 1. Published

Cronen-Townsend, S., and Croft, W.B., "Quantifying Query Ambiguity", Proceedings of HLT 2002, vol. , (2002), p. 94-98. Published

Cronen-Townsend, S., Zhou, Y., and Croft, W.B., "Predicting Query Performance", Proceedings of SIGIR 2002, vol. , (2002), p. 299-306. Published

Diaz, F. and Allan, J., "Browsing-based User Language Models for Information Retrieval", CIIR Technical Report, vol. , (2003), p. 1. Published

Diaz, Fernando, "Using Wearable Computers to Construct Semantic Representations of Physical Spaces", Proceedings of the ISWC '02 Conference, vol. , (2002), p. 197. Published

Kelly, D. & Belkin, N. J., "A User Modeling System for Personalized Interaction and Tailored Retrieval in Interactive IR", In Proceedings of Annual Conference of the American Society for Information Science and Technology (ASIST 2002), vol. , (2002), p. 316. Published

Kelly, D. and Belkin, N.J., "Display time as implicit feedback: understanding task effects", Proceedings of the Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2004), vol. , (2004), p. 377. Published

Kelly, D., "Understanding implicit feedback and document preference: a naturalistic user study.", Ph.D. Dissertation, School of Communication, Information and Library Studies, Rutgers University, New Brunswick, NJ., vol. , (2004), p. 1. Published

Kelly, D., Diaz, F., Belkin, N. and Allan, J., "A User-center Approach to Evaluating Topic Models", Proceedings of the 26th European Conference on Information Retrieval, vol. , (2004), p. 27. Published

Kelly, D., Murdock, V., Yuan, X., Croft, W. B., Belkin, N., "Features of Documents Relevant to Task- and Fact-Oriented Questions", Proceedings of Conference on Information and Knowledge Management (CIKM'02), vol. , (2002), p. 645. Published

Liu, X., and Croft, W. B., "Cluster-Based Retrieval Using Language Models", Proceedings of SIGIR '04, vol. , (2004), p. 186. Published

Murdock, V. and Croft, W. B., "Simple Translation Models for Sentence Retrieval in Factoid Question Answering", Proceedings of the SIGIR 2004 Workshop on Information Retrieval for Question Answering, vol. , (2004), p. 31. Published

Murdock, V., and Croft, W. B., "Improving Retrieval for Task-Oriented Questions", Proceedings of SIGIR 2002, vol. , (2002), p. 355. Published

Raghavan, H., McCallum, A. and Allan, J., "Using Markov Random Fields for Relational Classification in Social Networks Mined from Text", CIIR Technical Report, vol., (2003), p. 1.


Shah, C, and Croft, W. B., "Evaluating High Accuracy Retrieval Techniques", Proceedings of SIGIR '04, vol. , (2004), p. 2. Published

Wu, M., Muresan, G., McLean, A., Tang, M.-C., Wilkinson, R., Li, Y., Lee, H.-J. & Belkin, N.J., "Human versus machine in the topic distillation task", Proceedings of the Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2004), vol. , (2004), p. 385. Published

Area References

N.J. Belkin, R.N. Oddy, H.M. Brooks, 1982. 'ASK for information retrieval: Part I: background and theory; Part II: Results of a design study.' Journal of Documentation, 38(2-3), p. 61-71; 145-164

W.B. Croft (ed.), Advances in Information Retrieval. Kluwer Academic, 2000.

P.J. Daniels, H.M. Brooks, N.J. Belkin, 1985. 'Using problem structures for driving human-computer dialogues'. Proceedings of RIAO '85, p. 131-149.

J. Ponte, W.B. Croft, 1998. 'A Language Modeling Approach to Information Retrieval.' Proceedings of the 21st International Conference on Research and Development in Information Retrieval, p. 275-281.

C.J. Van Rijsbergen, 1979. Information Retrieval, Second edition, Butterworths, London.

This work was supported by the National Science Foundation (Award Numbers IIS-9907018 and IIS-9911942).