Web Answer Passages (WebAP) Dataset

Background

This dataset is the patched (Oct 2015) version, which eliminates inconsistent relevance judgments across duplicate sentences. See the README.txt for details.

The dataset is derived from the 2004 TREC Terabyte Track Gov2 collection and its corrsponding description-length queries. Persons were tasked with highlighting answer passages in the data using an annotation system with another person performing quality control. A scale was developed for marking the quality of each answer passage that included "perfect", "excellent", "good" and "fair".

The result was 8,027 answer passages to 82 TREC queries, averaging about 97 passages per query. Among all annotated passages, 43% were labeled as perfect, 44% excellent, 10% good and the final 3% as fair.

Among the 82 selected queries, only 2 had no annotated passages (queries 715 and 752).

Average length of a passage was 45 words.

For further information on the development of this dataset and its research use, see:

Keikha, Park, Croft and Sanderson, ACDS 2014

Keikha, Park and Croft, SIGIR 2014

For papers on the use the WebAP data, see:

Daniel Cohen, W. Bruce Croft. 2018. "A Hybrid Embedding Approach to Noisy Answer Passage Retrieval". In Proceedings of the 40th European Conference on Information Retrieval (ECIR 2018).

Liu Yang, Qingyao Ai, Damiano Spina, Ruey-Cheng Chen, Liang Pang, W. Bruce Croft, Jiafeng Guo and Falk Scholer. 2016. "Beyond Factoid QA: Effective Methods for Non-factoid Answer Sentence Retrieval". To appear in Proceedings of the 38th European Conference on Information Retrieval (ECIR 2016).

Evi Yulianti, Ruey-Cheng Chen, Falk Scholer, and Mark Sanderson. 2016. "Using Semantic and Context Features for Answer Summary Extraction". In Proceedings of the 21st Australasian Document Computing Symposium (ADCS '16).

Ruey-Cheng Chen, Damiano Spina, W. Bruce Croft, Mark Sanderson and Falk Scholer. 2015. "Harnessing semantics for answer sentence retrieval". In Proceedings of the Eighth Workshop on Exploiting Semantic Annotations in Information Retrieval, ESAIR '15 (CIKM'15 workshop).

Email Downloads for questions or comments concerning the dataset or this web page.

Dataset

Queries

The gov2.queriesAllRawFile.txt file contains the original 150 TREC Gov2 query set.
The gov2.query.json file consists of 82 queries selected from the original 150 queries of the TREC Gov2 query set.
Some queries from the original were dropped as being unlikely to contain passage level answers.

Documents

The documents are divided into sentences marked with tags.
Sentences that are answers to queries are enclosed in tags indicating their level of relevance.
Relevance tags include <perfect>, <excel>, <good> and <fair>.
Each document has an ID representing the concatenation of the original Gov2 document ID and the corresponding query ID for which it is annotated.
There are a total of 6,399 documents in the dataset.
Annotations are for the top 50 documents retrieved using the Sequential Dependency Model of retrieval from the TREC Gov2 collection.
All annotations reside in the single grade.trectext_patched file.

Notice To Users

It has been noted that some Gov2 source document texts are missing in the WebAP dataset. This is possibly due to the use of Lynx during the annotation creation process.

Lynx is a text-based HTML browser which focuses on producing readable text from full media-based pages, which might include tool tips, images, links and other types of HTML markup. Lynx can therefore miss text or move it around in its output compared with the original source.

The user of this data should be aware of occasional differences in content between Gov2 source documents and the corresponding annotated documents in this collection.

As an example, note the missing text from WebAP document GX262-28-10569245-701 compared with the original Gov2 source document GX262-28-10569245.

Download

The README.txt file is included in both archives, but is available here individually so one may obtain an overview of dataset characteristics and content.

Also available for download is the original ranking file used for document annotations. The file lists the top 50 documents in rank order for each query in the query set. Although no scores are presented, the ranking order may be of use for some researchers.

Uncompress the rar archive using 7zip or rar/unrar on Windows machines. Both these utilities may also be installed on Unix machines.
rar x -r WebAP.rar
7zip x -r WebAP.rar

On Unix machines, untar the gziped tar archive using tar.
tar xvzf WebAP.tar.gz

File Name	Compressed Size	Uncompressed Size
README.txt	---	2.5K
gov2_top50_doclist.txt	---	222K
WebAP rar archive	52M	269M
WebAP tar gzip archive	73M	269M

Acknowledgements

This work was supported in part by the Center for Intelligent Information Retrieval and in part by the National Science Foundation grant #IIS-1419693. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.

Thanks also to Mostafa Keikha, Jae Hyun Park and Liu Yang for their efforts in developing this dataset.