Web Answer Passages (WebAP) Dataset
===================================

The currently available dataset has been patched from the originally available
data due to issues of inconsistent judgements over duplicate sentences.  This
has been corrected in this version.  The correction process is described at 
the section at the end of this README.

The data set includes a data file and two query files.  The size of the 
annotated data file is 275M. 

82 queries are selected from the original 150 queries since these queries are
more likely to have passage level answers. The annotation is for the top 50 
documents retrieved by SDM from Gov2 collection.  The annotations are in a 
single file.  

The documents are divided into sentences, each of which is in a <SENTENCE> tag.
Those sentences that are answers are enclosed in extra tags that show their 
level of relevance: <PERFECT>, <EXCEL>, <GOOD> and <FAIR>.  Sentences or groups
of sentences enclosed in <NONE> tags are those that had no answer passage 
relevance judgments associated with them.

Each document has a new ID that is a concatenation of the original Gov2 doc ID 
and the corresponding query ID that it is annotated for. 

For details on the annotation process, refer to sections 3 and 5 of

  http://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1155 and 
  http://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1156.

Both of these papers describe this data in more detail.


Notice To Users
---------------

It has been noted that some Gov2 source document texts are missing in the WebAP 
dataset. This is possibly due to the use of Lynx during the annotation creation 
process.

Lynx is a text-based HTML browser which focuses on producing readable text from 
full media-based pages, which might include tool tips, images, links and other 
types of HTML markup. Lynx can therefore miss text or move it around in its output
compared with the original source.

The user of this data should be aware of occasional differences in content between
Gov2 source documents and the corresponding annotated documents in this collection.

As an example, note the missing text from WebAP document GX262-28-10569245-701 
compared with the original Gov2 source document GX262-28-10569245.


Patched Version Notes
---------------------

A known problem with the data was that judgments over duplicate sentences were
not made consistent.  Some of these problematic sentences were "reused"
multiple times across different documents and were annotated differently.  That
near-duplicate documents were not entirely filtered prior to annotation was
certainly one of the causes, but simply removing document duplicates did not
fix the problem.  Three fourths of these sentences actually came from reused 
content or excerpts.

In the patch released in early 2015, we developed the following two-stage 
procedure to rectify the issue.

1) We removed 1,100 near-duplicate documents in the original dataset by using
difflib in Python.  Specifically, we used the SequenceMatcher API and removed
documents with ratio() > 0.9 against any previously-seen document.  We ended up
with totally 6,399 documents, which amounts to 1,959,777 sentences.

2) Then, we went ahead fixing the judgments at sentence level by a majority
vote in the pool of duplicate sentences, breaking tie by favoring more relevant
label.  For example, a set of duplicate sentences labeled <0, 0, 3> would lead
to new annotation <0, 0, 0>; whereas that labeled <0, 3> would yield <3, 3>.