Trevor Strohman

Ph.D. Computer Science
University of Massachusetts, Amherst

Now at Google.

Research Interests

I am interested in algorithms, technologies, and systems that help people understand arbitrarily structured datasets (particularly large ones).

Indri

I was the primary developer of Indri, a scalable language modeling search engine that supports structured queries. Indri is a part of the Lemur project, and includes code and substantial insight from Howard Turtle and Don Metzler.

Galago

Galago is a new search engine I have developed for my dissertation work. It includes a distributed computation framework called TupleFlow which is an extension of MapReduce. In addition, it can build three different kinds of indexes, two of which are used in my dissertation, and a third kind which supports a subset of the Indri query language.

Publications

Efficient Processing of Complex Features for Information Retrieval
Strohman, T. "Efficient Processing of Complex Features for Information Retrieval," Ph.D. Dissertation, University of Massachusetts Amherst, 2007. [info] [pdf] [slides]
Efficient Document Retrieval in Main Memory
Strohman, T., Croft, W. B., "Efficient Document Retrieval in Main Memory," SIGIR 2007, 2007. [info] [pdf] [slides] [abstract]

Disk access performance is a major bottleneck in traditional information retrieval systems. Compared to system memory, disk bandwidth is poor, and seek times are worse.

We circumvent this problem by considering query evaluation strategies in main memory. We show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness.

We evaluate our techniques using Galago, a new retrieval system designed for efficient query processing. Our system achieves a 69% improvement in query throughput over previous methods.

Recommending Citations for Academic Papers
Strohman, T., Croft, W. B., Jensen, D. "Recommending Citations for Academic Papers," SIGIR 2007, 2007. [info] [pdf] [poster] [abstract]

We approach the problem of academic literature search by considering an unpublished manuscript as a query to a search system. We use the text of previous literature as well as the citation graph that connects it to find relevant related material. We evaluate our technique with manual and automatic evaluation methods, and find an order of magnitude improvement in mean average precision as compared to a text similarity baseline.

Recommending Citations for Academic Papers
Strohman, T., Croft, W. B., Jensen, D. "Recommending Citations for Academic Papers," CIIR Technical Report IR-466, 2006. [info] [pdf] [abstract]

Substantial effort is wasted in scientific circles by researchers who rediscover ideas that have already been published in the literature. This problem has been alleviated somewhat by the availability of recent academic work online. However, the kinds of text search systems in popular use today are poor at handling vocabulary mismatch, so a researcher must know the words used in relevant documents in order to find them. This makes serendipitous results unlikely.

We approach the problem of literature search by considering an unpublished manuscript as a query to a search system. With this approach, the entire text content of the paper can be used in the search process. We use the text of previous literature as well as the citation graph that connects it to find relevant related material. We evaluate our technique with manual and automatic evaluation methods, and find an order of magnitude improvement in mean average precision as compared to a text similarity baseline.

Low Latency Index Maintenance in Indri
Strohman, T., Croft, W. B. "Low Latency Index Maintenance in Indri," Proceedings of the SIGIR Workshop on Open Source Information Retrieval (OSIR), pp. 7-11, 2006. [pdf] [abstract] [slides]

There has been a resurgence of interest in index maintenance (or incremental indexing) in the academic community in the last three years. Most of this work focuses on how to build indexes as quickly as possible, given the need to run queries during the build process.

This work is based on a different set of assumptions than previous work. First, we focus on latency instead of throughput. We focus on reducing index latency (the amount of time between when a new document is available to be indexed and when it is available to be queried) and query latency (the amount of time that an incoming query must wait because of index processing). Additionally, we assume that users are unwilling to tune parameters to make the system more efficient.

We show how this set of assumptions has driven the development of the Indri index maintenance strategy, and describe the details of our implementation.

Custom Object Layout for Garbage Collected Languages
Novark, G., Strohman, T., Berger, E., "Custom Object Layout for Garbage Collected Languages," UMass Technical Report TR-06-06, 2006. [pdf] [abstract]
Modern architectures require data locality to achieve performance. However, garbage-collected languages like Java limit the ability of programmers to influence object locality, and so impose a significant performance penalty. We present custom object layout, an approach that allows programmers to control object layout in garbage-collected languages. Custom object layout cooperates with copying garbage collection. At collection time, the garbage collector invokes programmer-supplied methods that direct object placement. Custom object layout is particularly effective at improving the locality of classes with well-known traversal patterns, such as dictionary data structures. We show that using custom object layout can reduce cache misses by 50%-77% and thus improves the query performance of dictionary data structures by 20%.
Indri at TREC 2005: Terabyte Track (Notebook Version)
Metzler, D., Strohman, T., Zhou, Y., and Croft, W.B., "Indri at TREC 2005: Terabyte Track (Notebook Version)," in the TREC 2005 Notebook, 175-180, 2005. [pdf] [abstract]
This work details the experiments carried out using the Indri search engine during the TREC 2005 Terabyte Track. Results are presented for each of the three tasks, including eficiency, ad hoc, and named page finding. Our educiency runs focused on query optimization techniques, our ad hoc runs look at the importance of term proximity and document quality, and our named-page finding runs investigate the use of document priors and document structure.
UMass Robust 2005 Notebook: Using Mixture of Relevance Models for Query Expansion
Metzler, D., Diaz, F., Strohman, T., and Croft, W.B., "UMass Robust 2005 Notebook: Using Mixture of Relevance Models for Query Expansion," in the TREC 2005 Notebook, 839-840, 2005. [pdf] [abstract]
This paper describes the UMass TREC 2005 Robust Track experiments. We focus on approaches that use term proximity and pseudo-relevance feedback using external collections. Our results indicate both approaches are highly effective.
Dynamic Collections in Indri
Strohman, T. Dynamic Collections in Indri. CIIR Technical Report IR-426, 2005. [pdf] [abstract]
Text search engines have historically been designed for unchanging collections of documents. While this is fine for many applications, a growing number of important applications in news, finance, law and desktop search require indexes that can be efficiently updated. Previous research into supporting dynamic collections revolves around incremental methods. Incremental systems are optimized for adding large batches of documents to an existing index. These systems do not generally allow for queries to run while an incremental update is taking place. This work presents recent changes to the Indri search engine to support dynamic collections. Unlike previous incremental systems, Indri does not require large batch sizes to achieve efficient indexing performance. Indri is also designed to be as concurrent as possible, allowing queries to run while documents are added to the system.
Optimization Strategies for Complex Queries
Strohman, T., Turtle, H., Croft, W. B. "Optimization Strategies for Complex Queries" Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 219-225, 2005. [pdf] [abstract]
Previous research into the efficiency of text retrieval systems has dealt primarily with methods that consider inverted lists in sequence; these methods are known as term-at-a-time methods. However, for complex queries involving field information and proximity between terms, the document-at-a-time scoring method has distinct advantages. We present an improvement to the max score optimization, which is the most efficient known document-at-a-time scoring method. Like max_score, our technique, called term bounded max_score, is guaranteed to return exactly the same scores and documents as an unoptimized evaluation, which is particularly useful for query model research. We simulated our technique to explore the problem space, then implemented it in Indri, our large scale language modeling search engine. Tests with the GOV2 corpus on title queries show our method to be 23% faster than max_score alone, and 61% faster than our document-at-a-time baseline. Our optimized query times are competitive with conventional term-at-a-time systems on this year's TREC Terabyte task.
Indri: A language model-based search engine for complex queries
Strohman, T., Metzler, D., Turtle, H., Croft, W. B. "Indri: A language model-based search engine for complex queries." Proceedings of the International Conference on Intelligence Analysis, May 2-6, 2005, McLean, VA. (poster) [pdf] [extended pdf] [abstract]
Search engines are a critical tool for intelligence analysis. A number of innovations for search have been introduced since research with an em-phasis on analyst needs began in the TIPSTER project. For example, the Inquery search engine introduced support for specification of complex queries in a probabilistic inference network framework. Recent research on language model-ing has led to the development of Indri, a search engine that combines the best features of inference nets and language modeling in an architecture designed for large-scale applications. In this paper, we describe the Indri system and show how the query language is designed to support modern language technologies. We also present results demonstrating that Indri is both effective and efficient.
Indri at TREC 2004: Terabyte Track
Metzler, D., Strohman T., Turtle H., and Croft, W.B., "Indri at TREC 2004: Terabyte Track," in the Online Proceedings of 2004 Text REtrieval Conference (TREC 2004). [pdf] [abstract]
This paper provides an overview of experiments carried out at the TREC 2004 Terabyte Track using the Indri search engine. Indri is an efficient, effective distributed search engine. Like INQUERY, it is based on the inference network framework and supports structured queries, but unlike INQUERY, it uses language modeling probabilities within the network which allows for added flexibility. We describe our approaches to the Terabyte Track, all of which involved automatically constructing structured queries from the title portions of the TREC topics. Our methods use term proximity information and HTML document structure. In addition, a number of optimization procedures for efficient query processing are explained.
UMass at TREC 2004: Notebook
Abdul-Jaleel, N., Allan, J., Croft, W.B., Diaz, F., Larkey, L., Li, X., Metzler, D., Strohman, T., Turtle, H., and Wade, C., "UMass at TREC 2004: Notebook," to appear in Text REtrieval Conference (TREC 2004), Gaithersburg, Maryland, November 16-19, 2004. [pdf]

Lectures

I recently gave two lectures on Information Retrieval in Gerome Miklau's undergraduate database course. The slides for the Ranking Documents and Building Systems lectures are available here.

Information Pages

Other Info

My wife and I run a small software company called Ampersandbox, which makes software for Mac OS X. Before Ampersandbox, I wrote software for Veritas and a couple of game companies that didn't make it.

I play guitar and mandolin. Here's a sample of me playing through Barrel of Fun, arranged by Russ Barenberg, plus a sample of something original.

Evan has arrived!