Oren Kurland Cornell University Title: Corpus Structure, Language Models, and Ad Hoc Information Retrieval Date: Mar 7th, 2005 Location: CMPSCI 151 Abstract -------- The fundamental principle of the language-modeling approach to ad hoc information retrieval is that given a query, a document will be ranked according to the probability assigned to the query by a language model constructed from that document. Most previous work on the language-modeling approach to ad hoc information retrieval focuses on document specific-characteristics, and therefore doesn't take into account the structure of the surrounding corpus. We propose a novel algorithmic framework in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents. In this talk, we will first present the framework and describe a suite of new algorithms that are natural instantiations of it. Even the simplest typically outperforms the standard language-modeling approach. We will then discuss connections to other work such as latent-variable models and cluster-based retrieval using language models (Liu&Croft, 2004) and present results demonstrating the effectiveness of our algorithms with respect to different aspects such as information representation and estimation methods.