Understanding the user's intent or information need that underlies a query has long been recognized as a crucial part of effective information retrieval. Despite this, retrieval models, in general, have not focused on explicitly representing intent, and query processing has been limited to simple transformations such as stemming or spelling correction. With the recent availability of large amounts of data about user behavior and queries in web search logs, there has been an upsurge in interest in new approaches to query understanding and representing intent.
This workshop has the goal of bringing together the different strands of research on query understanding, increasing the dialogue between researchers working in this relatively new area, and developing some common themes and directions, including definitions of tasks and evaluation methodology. We hope the workshop could bring together researchers from IR, ML, NLP, and other areas of computer and information science who are working on or interested in this area, and provide a forum for them to identify the issues and the challenges, to share their latest research results, to express a diverse range of opinions about this topic, and to discuss future directions.
The workshop program includes three main sessions: invited talks, poster session and panel discussion. Ten short invited talks by both academic and industrial researchers will give the participants a sense of different aspects of query understanding, and what are the current state of the art results in this research area. In the poster session, eight accepted papers will be presented in a form of "elevator pitch" plus printed poster. The panel discussion will focus on the issues related to query representation and understanding research, including a rigorous definition of the task, modeling for the task, challenges and opportunities, implications to IR, and future research directions.
Abstract: Query is often treated as a bag of words by search engines, but when people are formulating queries, they use "concepts" as building blocks. Can we automatically segment the query to recover the concepts? How can we use the identified concepts to improve search relevance? In this talk, I will present some techniques to some techniques for query concept identification and how can we use them to improve search relevance from query rewriting and machine learning ranking.
Abstract: A single user query represents a user information need. A short sequence of queries may represent a reformulation sequence as the user attempts to match the expression of that information need to the available documents. Longer sequences may give us patterns of a user's interests, as well as clues about who the user is, and how he or she is feeling. In this talk I show examples of the evidence that a user's personal qualities and emotional state can emerge from longer and longer sequences of queries and click behavior. I will also give some open problems where greater understanding of user intent can provide opportunities in online computational advertising.
Abstract: A query is inherently associated with rich context information, including, e.g., the user who posed the current query, the other queries entered by this user in the current retrieval session, and any documents viewed or skipped by the user in both the current session and past sessions. All these context variables provide important clues about a user's intent behind a query and thus should be exploited in order to under-stand query intent as accurately as possible. In this talk, I will present a general Bayesian decision-theoretic framework for incorporating all kinds of context information to model a user's information need dynamically as the user interacts with a retrieval system and exploiting dynamic user modeling to optimize retrieval results in an interactive retrieval system. I will also discuss how the framework has been used in the UCAIR project to naturally support statistical language models for query representation and achieve personalized search without requiring any user effort.
Abstract: Terms in a query can be strongly dependent. A number of previous studies have shown the benefits of taking into account the dependencies between query terms. Typically, a dependency model is defined and interpolated with the traditional bag-of-words model. Such an approach can take advantage of the more robust bag-of-words model and the more precise dependency model. However, we notice that most previous approaches assign a fixed weight to each of the models in the interpolation, regardless of the dependencies in the given query. In reality, the strength of a dependency and its utility for IR vary from a pair of words to another and from a query to another. A uniform interpolation cannot account correctly for the variable strength and utility of term dependencies in queries. In this talk, we will describe a new dependency model in which the dependency between a pair of words is taken into account in document retrieval according to its strength and utility. The more a dependency is useful, the higher the importance is assigned to it. SVM is used to learn the expected utility of a dependency based on a set of features. We tested this model on several collections from TREC and NTCIR, in both English and Chinese. Our results showed that the new model outperforms the existing ones on almost all the collections. This demonstrates the necessity to integrate term dependencies in a variable manner, according to their utility for IR.
Abstract: Information retrieval systems, especially portal web search engines, use a variety of query analysis techniques to detect user intent. While new intent classes are introduced every year, there has been little work comparing the relative importance of performance of intent classes. In this talk, I will discuss the relative importance of intent classes in order to motivate severity-based system design.
Abstract: Search engines' understanding of the user queries has advanced greatly over the last decade. Yet, expressing the searcher information needs still primarily relies on guessing the "right" search keywords, often requiring multiple rounds of trial-and-error from the searcher. At the same time, searcher interaction data is becoming increasingly available, at both server- and client-side. Extracting meaningful signals from this data would enable a search engine to accurately infer user intent for tasks such as real-time result reranking, dynamic result presentation, and contextualized query suggestion. This talk overviews our recent progress on modeling and exploiting client-side searcher interaction data for intent inference.
Abstract: Entity lists are vital for semantically analyzing web queries. In this talk, we propose a general information extraction framework, showing large gains in entity extraction by combining state-of-the-art distributional and pattern-based extractors with a large set of features from a 600 million document webcrawl, one year of query logs, and a snapshot of Wikipedia. We explore the hypothesis that although distributional and pattern-based algorithms are complementary, they do not exhaust the semantic space; other sources of evidence can be leveraged to better combine them. A detailed analysis of feature correlations and interactions shows that query log and webcrawl features yield the highest gains, but easily accessible Wikipedia features also improve over current state-of-the-art systems. We further study the impact of editor-chosen seeds on extraction performance. We show that in general few seeds are needed to saturate a distributional model and that seed compositionality is very sensitive resulting in tremendous variance on expansion performance. We further study the latter and show that untrained editors are terrible at choosing the right seeds and we propose an algorithm for helping editors choose better seeds.
Abstract: Most of the query understanding and representation research done today is either very general ("one size fits all") or highly specialized. The "one size fits all" approaches are general but tend to fail for certain classes of queries. At the other end of the spectrum are the specialized approaches that often require a great deal of domain knowledge and search expertise. In this talk, I will discuss the pros and cons of these two competing approaches and propose a challenge to develop a robust, fully automatic approach to specialized query understanding.
Abstract: Traditionally, queries in information retrieval applications are represented as bags-of-words, and query terms are assumed to be independent. In this talk, we formulate a retrieval framework that represents queries as structures. We demonstrate that such formulation allows to relax the independence assumption made in the previous work, and to create richer and more realistic query representations. Finally, we show how the structural query representation can serve as a basis for both existing and novel retrieval models.
Abstract: Bags of words have been thought as a basis of information retrieval for years. However, words are often too simple to convey clear semantic meanings, and become an important cause of mismatching problems. In this talk, we introduce a different view to look at the problem of query representation. Query understanding can be conducted at different levels or granularities of semantics, i.e. word level, sense level, topic level and structure level. The outputs of query understanding can be attached to queries as enriched representations and help to answer the queries on different difficulties. We will also talk about some published work that can be considered as small steps along this direction.
We solicit short position and research papers that would be presented as posters during the workshop. Relevant topics include, but are not limited to:
We solicit research papers, position papers or papers that describe research in progress to be presented as posters. Submitted papers should be in the ACM Conference style (for LaTeX, use the "Option 2" style) and not exceed 4 pages in 9 point font. Papers must be submitted in PDF electronically via the submission page (https://cmt.research.microsoft.com/QRU2010/). Submissions of papers should not substantially duplicate work that any of the authors have published elsewhere or have submitted in parallel to any other conferences or journals. All submissions must be in English and will be reviewed by at least three members of the program committee. At least one author of each accepted paper will be expected to attend and prepare a poster as well as a short presentation at the workshop.
Deadlines for workshop poster submissions are (note the extension of the submission deadline):