WebAP Dataset

Web Answer Passages (WebAP) Dataset


 


Background


This dataset is the patched (Oct 2015) version, which eliminates inconsistent relevance judgments across duplicate sentences. See the README.txt for details.

The dataset is derived from the 2004 TREC Terabyte Track Gov2 collection and its corrsponding description-length queries. Persons were tasked with highlighting answer passages in the data using an annotation system with another person performing quality control. A scale was developed for marking the quality of each answer passage that included "perfect", "excellent", "good" and "fair".

The result was 8,027 answer passages to 82 TREC queries, averaging about 97 passages per query. Among all annotated passages, 43% were labeled as perfect, 44% excellent, 10% good and the final 3% as fair.

Among the 82 selected queries, only 2 had no annotated passages (queries 715 and 752).

Average length of a passage was 45 words.


For further information on the development of this dataset and its research use, see:


For papers on the use the WebAP data, see:


Email Downloads for questions or comments concerning the dataset or this web page.


Dataset


Queries


Documents


Notice To Users

It has been noted that some Gov2 source document texts are missing in the WebAP dataset. This is possibly due to the use of Lynx during the annotation creation process.

Lynx is a text-based HTML browser which focuses on producing readable text from full media-based pages, which might include tool tips, images, links and other types of HTML markup. Lynx can therefore miss text or move it around in its output compared with the original source.

The user of this data should be aware of occasional differences in content between Gov2 source documents and the corresponding annotated documents in this collection.

As an example, note the missing text from WebAP document GX262-28-10569245-701 compared with the original Gov2 source document GX262-28-10569245.


Download


The README.txt file is included in both archives, but is available here individually so one may obtain an overview of dataset characteristics and content.

Also available for download is the original ranking file used for document annotations. The file lists the top 50 documents in rank order for each query in the query set. Although no scores are presented, the ranking order may be of use for some researchers.

Uncompress the rar archive using 7zip or rar/unrar on Windows machines. Both these utilities may also be installed on Unix machines.
     rar x -r WebAP.rar
     7zip x -r WebAP.rar

On Unix machines, untar the gziped tar archive using tar.
     tar xvzf WebAP.tar.gz


File Name
Compressed
Size
Uncompressed
Size
README.txt
---
2.5K
gov2_top50_doclist.txt
---
222K
WebAP rar archive
52M
269M
WebAP tar gzip archive
73M
269M


Acknowledgements


This work was supported in part by the Center for Intelligent Information Retrieval and in part by the National Science Foundation grant #IIS-1419693. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.

 

Thanks also to Mostafa Keikha, Jae Hyun Park and Liu Yang for their efforts in developing this dataset.