Background
    PsgRobust is an answer passage collection built upon the Robust04 collection without manual annotation.
    We built this collection for our research on iterative relevance feedback because of following reasons.
    For experiments of relevance feedback, it is neccesary to do experiments on collections that have multiple relevant answer passages for each query.
    However, there are few existing collections with queries that have multiple relevant answer passages. Most popular Question-answering datasets consist of queries that have only one relevant answer.
    For detailed information, please check our paper.
        Keping Bi, Qingyao Ai, W. Bruce Croft. "Iterative Relevance Feedback for Answer Passage Retrieval with Passage-level Semantic Match." In the proceeddings of the 41st European Conference on Information Retrieval, 14 pages, Springer, 2019

    This collection was built based on two assumptions:
        1. For the passages that are ranked in the top positions by a powerful ranker, if they are in the relevant documents, we assume they are relevant.
        2. All the passages in the non-relevant document are irrelevant.

    First, top 100 documents were retrieved for each title query in Robust04 with the Sequential Dependency Model (SDM) [1].
    Then a sliding window of 2 or 3 sentences is used to split the documents into passages with no overlap. Whether there are 2 or 3 sentences for each passage is decided randomly.
    Each passage was assigned an ID which is the DOCNO of the document joined with the passage id in the document by '_', i.e., DOCNO_passageNO.
    After that, top 100 passages were retrieved with SDM again for the title queries, and passages from relevant documents are treated as relevant.
    The recall of the top 100 documents is 0.43, which means that 43% of relevant documents for all queries were included in the passage collection on average. Overall, there are 246 queries with relevant passages in PsgRobust. Query 672 does not have relevant documents in the Robust04 collection. Query 309,314 and 412 do not have relevant passages in PsgRobust.


Dataset
    There are 22403 unique documents and 383036 passages in total, and 6589 relevant passages for the 246 queries, which are from 3544 documents.
    The compressed folder contains
    Queries:
        cv_query/
            The training and test queries of 5 folds for cross-validation we used in our experiments.
        PsgRobust.descs.tsv
            Questions (description queries) with relevant answer passages in the PsgRobust collection.
        robust04.descs.tsv
            The description queries in Robust04.
        robust04.titles.tsv
            The title queries in Robust04.
    Labels:
        PsgRobust.qrels
            Labels for relevant passages in PsgRobust.
        robust04.qrels
            Labels for relevant documents in Robust04.
    Passages:
        doc.trectext
            All the candidate passages in PsgRobust with TREC format.
            These passages were generated by the sliding window of 2 or 3 sentences on the top 100 retrieved documents.
    readme.txt
        The file you are currently reading.



Acknowledgements
    This work was supported in part by the Center for Intelligent Information Retrieval and in part by NSF IIS-1715095. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.

References
[1] Donald Metzler and W Bruce Croft. 2005. A Markov random field model for term dependencies. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 472–479.
[2] Keping Bi, Qingyao Ai, W. Bruce Croft. "Iterative Relevance Feedback for Answer Passage Retrieval with Passage-level Semantic Match." In the proceeddings of the 41st European Conference on Information Retrieval, 14 pages, Springer, 2019

