Research Page
Contents By Year
-
2011
- Book retrieval (Summer 2011)
- CrowdLogging: Distributed, private, and anonymous Search Logging (Spring 2010–Summer 2011)
- CrowdLogging Distributed, private, and anonymous Search Logging (Spring 2010–Summer 2011)
- Investigating Searcher Frustration (Spring 2009–Spring 2010)
- Investigating Searcher Frustration (Spring 2009–Spring 2010)
- Popularity Ranking using Query Logs (Spring–Summer)
- Detecting Task Boundaries in Domain-Specific Query Logs (Fall)
- Spell Correction using Domain-Specific Query Logs (Fall)
- Query Log Mining (Summer)
- Retrieving Document "Hot Spots" (Fall '07–Spring '08)
- Testing of Text Retrieval Test Collection Generation Algorithms (Summer)
- Qauality Assessment using Language Processing (QALP) (Fall '04–Spring '07)
- Slicing Research (Summer)
Book Retrieval
Summer 2011-
There's not too much to say about this just yet. I am working with James Allan and Marc Cartright on book retrieval methods to submit to the INEX book retrieval track. More on this later...
(Top of page)
CrowdLogging: Distributed, private, and anonymous Search Logging
Spring 2010–Spring 2011-
To help out with the study associated with this project, see this ad or visit the study homepage.
This joint work with James Allan and Joshua Glatt focuses on a system for collecting search logs in a distributed manner—storing the data on users' computers rather than in a central database. This gives each user full control over his or her search data. Mining jobs, or experiments as we call them, can be run on each user's computer over their data. Experiments produce search artifacts—pieces of information, which can be something like a query or a query pair. These artifacts are then encrypted and uploaded to a server via a bank of anonymizers. However, they encryption is special; it uses a secret sharing scheme, which only allows a particular artifact to be decrypted by the server if it has been uploaded by at least k different users. Using a secret sharing scheme is not new for search logging; Eytan Adar suggested using such a scheme in this paper. On the server side, we can aggregate the encrypted artifacts and decrypt the ones that have sufficient support (that is, the artifact has been uploaded by at least k distinct users) and then get counts for each unique artifact.
You can see a demo of a Javascript implementation of Shamir's Secret Sharing Scheme in action here. You can download it here.
Details about this system can be found in our SIGIR 2011 paper, CrowdLogging: Distributed, private, and anonymous search logging. We have implemented CrowdLogging into a system called CrowdLogger. You should take a look at the CrowdLogger website and consider participating in the study..
More on this work soon...
(Top of page)
Investigating Searcher Frustration
Spring 2009–Spring 2010-
One of the things I am interested in is understanding how information
retrieval frustration (e.g., frustration during a Web search) can be modeled and
how those models can be used to facilitate adaptive search systems that reduce
user frustration.
Many well-performing retrieval algorithms have developed over the
years (e.g., explicit relevance feedback, query expansion and reduction,
dependency modeling). However, many are not used because of the necessity of
user interaction and/or high cost, both of which are usually
unacceptable in a live search system. One hypothesis is that frustrated
users—i.e., searchers who are having difficulty finding the information they are seeking—are more apt to use
these alternative retrieval models. We then ask: when should what model be
presented to the user and how?
My masters thesis explored the feasibility of predicting frustration during a search session. To do this, we conducted two user studies in which participants were asked to search for predefined information needs, or tasks, and we used three physical sensors to track facial expressions, pressure exerted on the mouse, and sitting position. In addition, we constructed a Firefox toolbar to track participants' interactions with the browser, including mouse movements, document views, and feedback about pages visited and the queries submitted to four major search engines. We also describe this work in this SIGIR 2010 paper. The data for this work can be found here.
At some point, I want to put some more analysis of the 2009 study here, so stay tuned.
(Top of page)
Popularity Ranking using Query Logs
Spring–Summer 2009-
This is joint work with Bob Armstrong,
James Allan, and
Jeff Dalton. We are investigating
learning-to-rank techniques for use with the medical publication search company
that provides us with query logs.
The high-level idea is that we would like to incorporate a document's click-though rate
for a specific query into where that document is ranked in the results list. Ideally,
the most popular documents will be placed closer to the top of the list.
(Top of page)
Detecting Task Boundaries and Tasks in Query Logs
Fall 2008-
As part of an independent study, I implemented the task boundary detection work
described by (Jones and Klinkner, 2007) for use on the medical publication search data
used in my previous projects (see spelling correction
and query log mining). Jones and Klinkner's work
uses logistic regression to learn 1) when any two queries from the same search session are
part of the same task and 2) when there is a boundary between two consecutive queries.
Our study focused just on the first one.
In the context of this work, a task can be defined as a information goal or as an information mission. Goals contain one or more queries from the same search session, while missions contain one or more goals. However, a goal is not allowed to be in more than one mission.
In our study, we asked graduate students to label pairs of queries as belonging to the same goal and mission. We then used a maximum entropy classifier and ten-fold cross validation to learn the labels. After normalizing the features, we attained classification accuracies comparable to those used by Jones and Klinkner on Yahoo! data.
Going beyond previous work, we also used clustering techniques to create goal and mission groups using the pair-wise classifications from above. We then attempted to use these to detect when the user was frustrated (this makes the unfounded assumption that frustration is contained within a task).
This was interesting work, but a little ahead of its time with respect to the frustration detection. I plan to revisit this work in the future, once I've investigated more closely what searcher frustration is and how it can be automatically detected.
References
-
R. Jones and K. L. Klinkner. Beyond the
session timeout: automatic hierarchical segmentation
of search topics in query logs. Proceedings of CIKM
2008, 2008.
(Top of page)
Spell Correction using Medical Domain Query Logs
Fall 2008-
Building off of the exploration from the query log
mining work, Xing Yi and I
created a novel spelling correction algorithm that uses same-session reformulation
information—similar to (Jones et a., 2006)—as well as language models—similar
to (Cucerzan and Brill, 2004)—and trusted domain-specific lexicons.
References
-
S. Cucerzan and E. Brill. Spelling correction as an iterative
process that exploits the collective knowledge of web users.
In EMNLP, pages 293-300, 2004.
R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In WWW, pages 387-396, 2006.
(Top of page)
Query Log Mining
Summer 2008-
This line of work was more just implementation and data exploration. The implementation part
had to do with writing code to use with TupleFlow, a kind of MapReduce framework developed
by Trevor Strohman (now at Google).
Marc Cartright,
Elif Aktolga,
and I wrote a Java package that uses TupleFlow to mine query logs.
The query logs were from a medical document search company. We explored query reformulation,
click-through data, and query similarity.
(Top of page)
Retrieving Document "Hot Spots" (a.k.a. Passage Retrieval)
Fall 2007 - Spring 2008-
O. de Krester and A. Moffat. "Effective Presentation with a Locality-Based
Similarity Heuristic". 22nd Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, San Francisco, August
1999, 113-120.
[web]
Between Fall 2007 and Spring 2008, I worked towards developing a search algorithm to find the `hot spots' within a document relative to a given query.
The goal of this research is to return to a user a ranked list of text passages extracted from documents in a corpus. The passages are ranked according to the retrieval system's belief that a passage is relevant to the user's information need.
To do this, I created an algorithm to calculate the Query-Likelihood score for each document and then distributed that score across the document wherever a query term occurred. As can be seen below, the Query-Likelihood language model calculates the probability of a query, Q (which is made of of one or more terms), given a document, D. P(Q|D) is the product of the probability of each term, q, in the query given the document. This is done using a maximum likelihood estimation (MLE) of term frequencies in the document smoothed with a background model of term frequencies in the collection.
Now, think of a document as a giant array where the indices refer to term positions within the document and the value is a score associated with that position. My method computes each P(q|D) and then assigns a fair portion of that probability to each position in the document array where q occurs. So, if P(q|D) = .6 and q occurs 2 times, then both indices in the array that refer to the positions where q occurs in the document get a score of .3.
Once these scores are distributed, they are smoothed by placing a Gaussian kernel on top of each position in the array that has a non-zero score. All interference between kernel smoothings is additive, so that if two query terms are near by, once they are smoothed out, there should be several indices between the two that have a higher score than if only one of the query terms was present.
After smoothing, each group of contiguous indices with a score above a threshold are considered to be a 'hot spot'. I experimented with various automatic thresholding methods, including local (the threshold used for each document is different) and global (the threshold is constant over all documents for a particular query) methods. These generally involved setting the threshold to some percentage of the max score in the document or for all documents.
Initial results using the Text REtrieval Conference (TREC) Genomics 2006 corpus were not very promising. Because of these poor results and some higher priority projects, my work with locating document hot spots has been shelved for the time being.
References
(Top of page)
Testing of Text Retrieval Test Collection Generation Algorithms
Summer 2007During the Summer of 2007, I was a part of the the Summer Undergraduate Research Fellowship (SURF) program at the National Institute of Standards and Technology (NIST). I spent my time working for the Text REtrieval Conference (TREC) group testing the effectiveness of the Minimal Test Collection algorithm developed at the University of Massachusetts Amherst by Carterette, Allan, & Sitaraman. This was one of the two algorithms used to evaluate retrieval systems in the TREC Million Query Track in 2007 and 2008.
(Top of page)
Quality Assessment using Language Processing (QALP)
2004 - 2007The QALP team consists of:
Dr. David Binkley
Dr. Dawn Lawrie
Henry Feild
Dr. Chris Morrell (Statistical Consulting)
QALP Scores I
Winter-Spring 2005
Running off the idea that good code is accompanied by good comments, the QALP group put together a program to find the cosine similarity between code within a function and its comments. This similarity, referred to as a QALP Score, needed to be correlated to a quality rating for each function. To find a quality rating, a sample of functions were assembled into a survey, which was then given to undergraduate students. The functions in the survey were stripped of all comments and the students were asked to asses the each function's quality using a variety of criteria.
Identifier Splitting
Summer 2005 - Summer 2006
While many identifiers make use of some sort of division marker (e.g. an underscore or camelCasing) to break up concepts within an identifier, many do not, or at least not completely. The QALP group's favorite example of this is thenewestone. This particular identifier should be the_newest_one. Imagine having to analyze source code where identifiers all looked like that and you would have to mentally 'split' them. Wouldn't it be easier if there was a program that would split them for you? We thought so, so we looked into two methods—a 'greedy' algorithm and a neural network. Turns out the greedy algorithm works better.
Sources
Identifier Makeup Study
The QALP group was curious to see what type of identifiers were mostly easily understood by programmers. Consider variations on a variable that references the index of an array; you might see i, idx, or index. A survey of 12 functions was assembled and put up on the net which asked those who took it to explain the purpose of each function. For each of the 12 functions, there were three versions—one with single-character abbreviated variables, one with multi-character abbreviated variables, and one with full, non-abbreviated variables. The user was randomly given one of the three versions for each function.
QALP Scores II
Summer 2006
In an attempt to refine the study conducted in the Winter and Spring of 2005, this version upgraded the QALP Scoring process, scoring source code on a per-class basis.
In addition, instead of using a subjective measure of quality with which to correlate the QALP Score, this time defect rates were used.
(Top of page)
Slicing Research
Summer 2004My first summer assisting with research consisted of me finding source code for use with Dr. Binkley's Amorphous Program Slicer. I spent my time familiarizing myself with Linux, running the program slicer on open source programs, and then graphing results.
(Top of page)