Skip to topic | Skip to bottom
Home
Main
Main.IRseminar08-0411r1.11 - 11 May 2008 - 05:23 - JangwonSeotopic end

Start of topic | Skip to actions

IR seminar, April 11, 2008

Distributed IR, Jangwon Seo

Background:

Distributed IR (or federated search)is searching relevant documents or ranking databases in the environment that searchable databases are distributed on networks. Our goal is to understand three sub-problems - resource description, resource selection and results merging - and various scenarios - centralized vs. decentralized and cooperative vs. uncooperative - which distributed IR should address. Furthermore, we would think about real world situations for distributed IR.

Required papers:

13 years old CORI and GIOSS algorithms that the following two papers introduce are still considered as baselines for distributed IR. You can see many basic ideas to solve distributed IR problems there. Although distributed IR research still stays on testbeds, there are some papers based on real world experience. The following is one of them.

Recommended:

The following recent research papers relax some strict assumptions in traditional distributed IR.

If you have a taste:

Related Resources

To get more information about distributed IR, the following workshops could be good resources.

Your Questions

Please directly add your questions and thoughts to the wiki (or email to me) about this topic by Friday 1pm at the latest. When you save your editing, please select "Release edit lock" so that other people can edit it immediately.

  • Distributed IR seems to somewhat resemble IR from clusters. What are the differences? Clusters contain homogeneous data whereas I assume DBs are rather heterogeneous. Could an algorithm be designed that would gather data to be updated, and do occasional updates to the distributed DBs to make the DBs more homogenous? This would happen by rearranging documents across the DBs and adding new documents to the `right' DBs. Could this, if minimally employed, help the resource selection and merging of the results phases? (Elif)

  • In the research presented, the resource ranking is done based on the fit of the resource (DB) to the query. The discussed fit employs only textual features of the DB and assumes uniform prior distribution over the DB's. It would be interesting to examine how effective the non-textual features (used for document ranking in large scale systems) would be for resource ranking: click-through data, link-analysis, DB quality, size, personalization bias, etc. How could this be done? (Michael)

  • from the degree of decentralization, we can see that federated search lies between centralized search and p2p-search. today, centralized search system like Google can successfully deal with large scale data search problem by using carefully designed and well organized distributed index and search system, therefore in many cases, federated search systems are not so needed? federated search is suitable for the senarios where each site does not want to share their data. Thus, I double that elif's suggestions about rearranging documents across DBs could happen. I think that deploying search engines for each DB owners will be more possible: then federated search system like FedLemur? can work more efficiently with a standard API instead of using a perl wrapper, the deployed search engine can use very effective search methods so that the bad impact of ineffective individual search algorithm can be reduced. My question is whether we really need to merge results from very different DBs or let the user to choose the DB to search: if a user is only interested in the "apple"'s impact for people's health and only interested in the medical database, it is not very useful even a computer DB contains a lot of terms "apple". (Xing)

  • In the conclusion of "Distributed information retrieval," Jamie Callan mentions the potentiality of applying techniques such as relevance feedback to federated search. If we were to integrate techniques such as this into federated search, how would we go about doing so? For example, would it be reasonable to fetch documents, merge, then do pseudo relevance feedback on the final results list, and then re-fetch/merge documents based on the expanded query (this seems like it would be the most accurate, but could be very costly)? Or could we rely on the centralized sample document database for the pseudo relevance feedback and then apply the expanded query to the distributed databases? -- HenryFeild

  • In an environment where a FIR agent takes over the role of a search engine (i.e. search engines compete to just get chosen as the "best" source of data), how will that affect the adversarial landscape as we know it today? Currently sites try to boost their rankings through non-content-based means - should we expect whole collections to pop up that contain nothing but banner ad information and other spam? At what level would spam be handled? You can't regulate what kind of anti-adversarial measures are in place on search engines that you have no control over, but how well can you identify and filter out spam from little more than a resource description and/or some query-based samples? -- MarcCartright

  • @Elif : I think exchanging information in document level may be too expensive. Yet I thought that unigram LM for collection content representation was too simple. Maybe local clustering suggested in [Liu&Croft] can be better, as it enables multiple-topic representation for each cluser. @Xing : I think one of values of FIR is in allowing each server to use specialized ranking algorithm, although centralized algorithm can simulate this. Also, it seems to me that for users to choose resources(servers) will be possible only when users have good idea on individual resources, which is not true in many cases. Personalization based on query context can be a solution for resource selection. My question is how we can do this context-sensitive FIR? If we can model query context as LM, this won't be much different from pseudo-relevance feedback. - JinyoungKim


to top

I Attachment sort Action Size Date Who Comment
DistributedIRseminar.pdf manage 339.3 K 11 May 2008 - 05:22 JangwonSeo Slides for Lab meeting

You are here: Main > IRseminarS08 > IRseminar08-0411

to top

Copyright © 1999-2008 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback