Web Answer Passages (WebAP) Dataset =================================== The currently available dataset has been patched from the originally available data due to issues of inconsistent judgements over duplicate sentences. This has been corrected in this version. The correction process is described at the section at the end of this README. The data set includes a data file and two query files. The size of the annotated data file is 275M. 82 queries are selected from the original 150 queries since these queries are more likely to have passage level answers. The annotation is for the top 50 documents retrieved by SDM from Gov2 collection. The annotations are in a single file. The documents are divided into sentences, each of which is in a tag. Those sentences that are answers are enclosed in extra tags that show their level of relevance: , , and . Sentences or groups of sentences enclosed in tags are those that had no answer passage relevance judgments associated with them. Each document has a new ID that is a concatenation of the original Gov2 doc ID and the corresponding query ID that it is annotated for. For details on the annotation process, refer to sections 3 and 5 of http://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1155 and http://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1156. Both of these papers describe this data in more detail. Notice To Users --------------- It has been noted that some Gov2 source document texts are missing in the WebAP dataset. This is possibly due to the use of Lynx during the annotation creation process. Lynx is a text-based HTML browser which focuses on producing readable text from full media-based pages, which might include tool tips, images, links and other types of HTML markup. Lynx can therefore miss text or move it around in its output compared with the original source. The user of this data should be aware of occasional differences in content between Gov2 source documents and the corresponding annotated documents in this collection. As an example, note the missing text from WebAP document GX262-28-10569245-701 compared with the original Gov2 source document GX262-28-10569245. Patched Version Notes --------------------- A known problem with the data was that judgments over duplicate sentences were not made consistent. Some of these problematic sentences were "reused" multiple times across different documents and were annotated differently. That near-duplicate documents were not entirely filtered prior to annotation was certainly one of the causes, but simply removing document duplicates did not fix the problem. Three fourths of these sentences actually came from reused content or excerpts. In the patch released in early 2015, we developed the following two-stage procedure to rectify the issue. 1) We removed 1,100 near-duplicate documents in the original dataset by using difflib in Python. Specifically, we used the SequenceMatcher API and removed documents with ratio() > 0.9 against any previously-seen document. We ended up with totally 6,399 documents, which amounts to 1,959,777 sentences. 2) Then, we went ahead fixing the judgments at sentence level by a majority vote in the pool of duplicate sentences, breaking tie by favoring more relevant label. For example, a set of duplicate sentences labeled <0, 0, 3> would lead to new annotation <0, 0, 0>; whereas that labeled <0, 3> would yield <3, 3>.