<<O>>  Difference Topic IRseminar08-0404 (r1.16 - 05 Apr 2008 - MichaelBendersky)

META TOPICPARENT IRseminarS08

IR seminar, April 4, 2008, (this week at 9:30am in Room 142 )

Topic Evaluation, XingYi and XiaobingXue leading

Line: 32 to 32

Your Questions

Please directly add your questions and thoughts to the wiki (or email to us) about this topic by Friday 6am at the latest. When you save your editing, please select "Release edit lock" so that other people can edit it immediately. smile
Changed:
<
<
  • In Buttcher et al., when SVM classifier is applied for predicting relevance, it is interesting to note that precision is quite high (~75%), while recall is low (~35%). As classifier is based on textual similarity, this seems to support "cluster hypothesis" to some extent. From this the following observation could follow. First, we could use this text-based classifier to rank documents in the collection by their probability of belonging to the "relevant". Then, we could use this ranking to retrieve more documents to judge (according to the results, a high percentage of these documents will be relevant). We could iterate this process, potentially accumulating a fuller sample of a "relevant" class after each step, which could improve the recall as more and more relevant documents are discovered.
>
>
  • In Buttcher et al., when SVM classifier is applied for predicting relevance, it is interesting to note that precision is quite high (~75%), while recall is low (~35%). As classifier is based on textual similarity, this seems to support "cluster hypothesis" to some extent. From this the following observation could follow. First, we could use this text-based classifier to rank documents in the collection by their probability of belonging to the "relevant". Then, we could use this ranking to retrieve more documents to judge (according to the results, a high percentage of these documents will be relevant). We could iterate this process, potentially accumulating a fuller sample of a "relevant" class after each step, which could improve the recall as more and more relevant documents are discovered. -- Michael

  • The "Minimal Test Collections" and "Reliable Information Retrieval Evaluation" papers each present a method for assigning relevance judgments to unjudged documents. However, this makes the resulting qrels unique, making the reproducibility of experiments difficult. How could these evaluation techniques be integrated into the current IR research framework (e.g. should we have multiple versions of qrels floating around for each corpus, etc.)? --HenryFeild
 <<O>>  Difference Topic IRseminar08-0404 (r1.15 - 04 Apr 2008 - ElifAktolga)

META TOPICPARENT IRseminarS08

IR seminar, April 4, 2008, (this week at 9:30am in Room 142 )

Topic Evaluation, XingYi and XiaobingXue leading

Line: 39 to 39

  • I wonder how much domain knowledge can be put to use in this task? Is it easier to determine relevance/nonrelevance when you have a much narrower scope of topicality, or do you start splitting hairs? I would like to see what the effects are of using some of the automatic judging techniques (particularly the ones from "Reliable Information Retrieval Evaluation") when the KL divergence will most likely be dampened due to narrower topicality. I suspect SVMs would perform better here. -- MarcCartright

  • How can this traditional evaluation method be extended for evaluating personalized(or contextual) retrieval, where the notion of relevant may dependent on user(or context)? While some of current researches use click-through for mixed rank list of baseline and improved method, would TREC-style evaluation be possible? One way seems to be providing textual representation of user (or context) so that each participant can make the best use of it and have relevant judgment created by the user(or the one who understands the context). -- JinyoungKim
Added:
>
>

  • The papers discuss evaluation issues for document retrieval. What about other tasks, such as passage retrieval? To what extent can judgments for passages be generated from judgments at the document level (that contain these passages)? -- Elif

 <<O>>  Difference Topic IRseminar08-0404 (r1.14 - 04 Apr 2008 - JinyoungKim)

META TOPICPARENT IRseminarS08

IR seminar, April 4, 2008, (this week at 9:30am in Room 142 )

Topic Evaluation, XingYi and XiaobingXue leading

Line: 37 to 37

  • The "Minimal Test Collections" and "Reliable Information Retrieval Evaluation" papers each present a method for assigning relevance judgments to unjudged documents. However, this makes the resulting qrels unique, making the reproducibility of experiments difficult. How could these evaluation techniques be integrated into the current IR research framework (e.g. should we have multiple versions of qrels floating around for each corpus, etc.)? --HenryFeild

  • I wonder how much domain knowledge can be put to use in this task? Is it easier to determine relevance/nonrelevance when you have a much narrower scope of topicality, or do you start splitting hairs? I would like to see what the effects are of using some of the automatic judging techniques (particularly the ones from "Reliable Information Retrieval Evaluation") when the KL divergence will most likely be dampened due to narrower topicality. I suspect SVMs would perform better here. -- MarcCartright
Added:
>
>

  • How can this traditional evaluation method be extended for evaluating personalized(or contextual) retrieval, where the notion of relevant may dependent on user(or context)? While some of current researches use click-through for mixed rank list of baseline and improved method, would TREC-style evaluation be possible? One way seems to be providing textual representation of user (or context) so that each participant can make the best use of it and have relevant judgment created by the user(or the one who understands the context). -- JinyoungKim

 <<O>>  Difference Topic IRseminar08-0404 (r1.13 - 04 Apr 2008 - MarcCartright)

META TOPICPARENT IRseminarS08

IR seminar, April 4, 2008, (this week at 9:30am in Room 142 )

Topic Evaluation, XingYi and XiaobingXue leading

Line: 35 to 35

  • In Buttcher et al., when SVM classifier is applied for predicting relevance, it is interesting to note that precision is quite high (~75%), while recall is low (~35%). As classifier is based on textual similarity, this seems to support "cluster hypothesis" to some extent. From this the following observation could follow. First, we could use this text-based classifier to rank documents in the collection by their probability of belonging to the "relevant". Then, we could use this ranking to retrieve more documents to judge (according to the results, a high percentage of these documents will be relevant). We could iterate this process, potentially accumulating a fuller sample of a "relevant" class after each step, which could improve the recall as more and more relevant documents are discovered.

  • The "Minimal Test Collections" and "Reliable Information Retrieval Evaluation" papers each present a method for assigning relevance judgments to unjudged documents. However, this makes the resulting qrels unique, making the reproducibility of experiments difficult. How could these evaluation techniques be integrated into the current IR research framework (e.g. should we have multiple versions of qrels floating around for each corpus, etc.)? --HenryFeild
Added:
>
>

  • I wonder how much domain knowledge can be put to use in this task? Is it easier to determine relevance/nonrelevance when you have a much narrower scope of topicality, or do you start splitting hairs? I would like to see what the effects are of using some of the automatic judging techniques (particularly the ones from "Reliable Information Retrieval Evaluation") when the KL divergence will most likely be dampened due to narrower topicality. I suspect SVMs would perform better here. -- MarcCartright

 <<O>>  Difference Topic IRseminar08-0404 (r1.12 - 03 Apr 2008 - HenryFeild)

META TOPICPARENT IRseminarS08

IR seminar, April 4, 2008, (this week at 9:30am in Room 142 )

Topic Evaluation, XingYi and XiaobingXue leading

Line: 32 to 32

Your Questions

Please directly add your questions and thoughts to the wiki (or email to us) about this topic by Friday 6am at the latest. When you save your editing, please select "Release edit lock" so that other people can edit it immediately. smile
Changed:
<
<
  • In Buttcher et al., when SVM classifier is applied for predicting relevance, it is interesting to note that precision is quite high (~75%), while recall is low (~35%). As classifier is based on textual similarity, this seems to support "cluster hypothesis" to some extent. From this the following observation could follow. First, we could use this text-based classifier to rank documents in the collection by their probability of belonging to the "relevant". Then, we could use this ranking to retrieve more documents to judge (according to the results, a high percentage of these documents will be relevant). We could iterate this process, potentially accumulating a fuller sample of a "relevant" class after each step, which could improve the recall as more and more relevant documents are discovered. [Michael]
>
>
  • In Buttcher et al., when SVM classifier is applied for predicting relevance, it is interesting to note that precision is quite high (~75%), while recall is low (~35%). As classifier is based on textual similarity, this seems to support "cluster hypothesis" to some extent. From this the following observation could follow. First, we could use this text-based classifier to rank documents in the collection by their probability of belonging to the "relevant". Then, we could use this ranking to retrieve more documents to judge (according to the results, a high percentage of these documents will be relevant). We could iterate this process, potentially accumulating a fuller sample of a "relevant" class after each step, which could improve the recall as more and more relevant documents are discovered.

  • The "Minimal Test Collections" and "Reliable Information Retrieval Evaluation" papers each present a method for assigning relevance judgments to unjudged documents. However, this makes the resulting qrels unique, making the reproducibility of experiments difficult. How could these evaluation techniques be integrated into the current IR research framework (e.g. should we have multiple versions of qrels floating around for each corpus, etc.)? --HenryFeild

 <<O>>  Difference Topic IRseminar08-0404 (r1.11 - 03 Apr 2008 - MichaelBendersky)

META TOPICPARENT IRseminarS08

IR seminar, April 4, 2008, (this week at 9:30am in Room 142 )

Topic Evaluation, XingYi and XiaobingXue leading

Line: 31 to 31

Your Questions

Please directly add your questions and thoughts to the wiki (or email to us) about this topic by Friday 6am at the latest. When you save your editing, please select "Release edit lock" so that other people can edit it immediately. smile
Added:
>
>

  • In Buttcher et al., when SVM classifier is applied for predicting relevance, it is interesting to note that precision is quite high (~75%), while recall is low (~35%). As classifier is based on textual similarity, this seems to support "cluster hypothesis" to some extent. From this the following observation could follow. First, we could use this text-based classifier to rank documents in the collection by their probability of belonging to the "relevant". Then, we could use this ranking to retrieve more documents to judge (according to the results, a high percentage of these documents will be relevant). We could iterate this process, potentially accumulating a fuller sample of a "relevant" class after each step, which could improve the recall as more and more relevant documents are discovered. [Michael]

 <<O>>  Difference Topic IRseminar08-0404 (r1.10 - 02 Apr 2008 - ElifAktolga)

META TOPICPARENT IRseminarS08
Changed:
<
<

IR seminar, April 4, 2008

Topic Evaluation, XingYi and XiaobingXue leading.

>
>

IR seminar, April 4, 2008, (this week at 9:30am in Room 142 )

Topic Evaluation, XingYi and XiaobingXue leading


Background:

This week’s session will discuss the challenges of reliably evaluating different IR systems in the scenarios of incomplete/biased/noisy relevant judgments. The goal is to understand previous endeavors on this topic, some state-of-the-art techniques and results; then we may broadly discuss any interesting ideas related to this topic or suggest plausible solutions. Previous research could be roughly put into three folds: 1) to analyze the problems caused by incomplete relevant judgments and the impact on the existing IR measurements; 2) to design new IR measurements which can be more reliable/robust when judgments are incomplete/noisy; 3) to design new techniques to judge as few docs as possible without scarifying effectiveness and reliability.
 <<O>>  Difference Topic IRseminar08-0404 (r1.9 - 02 Apr 2008 - XingYi)

META TOPICPARENT IRseminarS08

IR seminar, April 4, 2008

Topic Evaluation, XingYi and XiaobingXue leading.

Line: 28 to 28

Background

Added:
>
>

Your Questions

Please directly add your questions and thoughts to the wiki (or email to us) about this topic by Friday 6am at the latest. When you save your editing, please select "Release edit lock" so that other people can edit it immediately. smile

 <<O>>  Difference Topic IRseminar08-0404 (r1.8 - 01 Apr 2008 - XingYi)

META TOPICPARENT IRseminarS08

IR seminar, April 4, 2008

Topic Evaluation, XingYi and XiaobingXue leading.

Line: 8 to 8

Required papers:

We list four required papers here. Since most people are familiar with Ben's work, the required number is still three actually.
Changed:
<
<
>
>

Line: 17 to 17

Changed:
<
<
>
>

Tangential but interesting:

Changed:
<
<
>
>

Background

  • The significance of the cranfield tests on index languages. This is an very old paper which describes the cranfield methodology of evaluating different systems. Basically, to evaluate different systems, it discusses how to create query, gather relevance judgments and the role of precision/recall.
 <<O>>  Difference Topic IRseminar08-0404 (r1.7 - 01 Apr 2008 - XiaobingXue)

META TOPICPARENT IRseminarS08

IR seminar, April 4, 2008

Topic Evaluation, XingYi and XiaobingXue leading.

Line: 8 to 8

Required papers:

We list four required papers here. Since most people are familiar with Ben's work, the required number is still three actually.
Added:
>
>

Deleted:
<
<

Recommended:

Changed:
<
<
>
>

Tangential but interesting:

Added:
>
>

Background

  • The significance of the cranfield tests on index languages. This is an very old paper which describes the cranfield methodology of evaluating different systems. Basically, to evaluate different systems, it discusses how to create query, gather relevance judgments and the role of precision/recall.
 <<O>>  Difference Topic IRseminar08-0404 (r1.6 - 01 Apr 2008 - XiaobingXue)

META TOPICPARENT IRseminarS08

IR seminar, April 4, 2008

Topic Evaluation, XingYi and XiaobingXue leading.

 <<O>>  Difference Topic IRseminar08-0404 (r1.5 - 01 Apr 2008 - XingYi)

META TOPICPARENT IRseminarS08

IR seminar, April 4, 2008

Topic Evaluation, XingYi and XiaobingXue leading.

Line: 8 to 8

Required papers:

We list four required papers here. Since most people are familiar with Ben's work, the required number is still three actually.
Changed:
<
<
>
>

Line: 17 to 17

Added:
>
>

Tangential but interesting:

Line: 24 to 25

Background

Changed:
<
<
>
>

 <<O>>  Difference Topic IRseminar08-0404 (r1.4 - 31 Mar 2008 - XiaobingXue)

META TOPICPARENT IRseminarS08

IR seminar, April 4, 2008

Topic Evaluation, XingYi and XiaobingXue leading.

Line: 7 to 7

This week’s session will discuss the challenges of reliably evaluating different IR systems in the scenarios of incomplete/biased/noisy relevant judgments. The goal is to understand previous endeavors on this topic, some state-of-the-art techniques and results; then we may broadly discuss any interesting ideas related to this topic or suggest plausible solutions. Previous research could be roughly put into three folds: 1) to analyze the problems caused by incomplete relevant judgments and the impact on the existing IR measurements; 2) to design new IR measurements which can be more reliable/robust when judgments are incomplete/noisy; 3) to design new techniques to judge as few docs as possible without scarifying effectiveness and reliability.

Required papers:

Changed:
<
<
We list four required papers here, since most people are familiar with Ben's work.
>
>
We list four required papers here. Since most people are familiar with Ben's work, the required number is still three actually.

Line: 23 to 23

Changed:
<
<

Related Resources

>
>

Background


 <<O>>  Difference Topic IRseminar08-0404 (r1.3 - 31 Mar 2008 - XiaobingXue)

META TOPICPARENT IRseminarS08

IR seminar, April 4, 2008

Changed:
<
<

Topic Evaluation, Xing and Xiaobing leading.

>
>

Topic Evaluation, XingYi and XiaobingXue leading.


Background:

Changed:
<
<
>
>
This week’s session will discuss the challenges of reliably evaluating different IR systems in the scenarios of incomplete/biased/noisy relevant judgments. The goal is to understand previous endeavors on this topic, some state-of-the-art techniques and results; then we may broadly discuss any interesting ideas related to this topic or suggest plausible solutions. Previous research could be roughly put into three folds: 1) to analyze the problems caused by incomplete relevant judgments and the impact on the existing IR measurements; 2) to design new IR measurements which can be more reliable/robust when judgments are incomplete/noisy; 3) to design new techniques to judge as few docs as possible without scarifying effectiveness and reliability.

Required papers:

Changed:
<
<
>
>
We list four required papers here, since most people are familiar with Ben's work.

Recommended:

Changed:
<
<
>
>

Tangential but interesting:

Changed:
<
<
>
>

Related Resources

 <<O>>  Difference Topic IRseminar08-0404 (r1.2 - 24 Mar 2008 - XiaobingXue)

META TOPICPARENT IRseminarS08

IR seminar, April 4, 2008

Changed:
<
<

Topic, people leading.

>
>

Topic Evaluation, Xing and Xiaobing leading.


Background:

 <<O>>  Difference Topic IRseminar08-0404 (r1.1 - 25 Feb 2008 - JamesAllan)
Line: 1 to 1
Added:
>
>
META TOPICPARENT IRseminarS08

IR seminar, April 4, 2008

Topic, people leading.

Background:

Required papers:

Recommended:

Tangential but interesting:

Related Resources

Revision r1.1 - 25 Feb 2008 - 19:21 - JamesAllan
Revision r1.16 - 05 Apr 2008 - 23:18 - MichaelBendersky