Index of /downloads/eqfe

[ICO]NameLast modifiedSizeDescription

[DIR]Parent Directory  -  
[TXT]revised-clueweb09-annotations.tsv22-Jul-2014 14:45 17K 
[DIR]robust04/22-Jul-2014 14:45 -  
[DIR]runs/22-Jul-2014 14:45 -  

Online appendix for Entity Query Feature Expansion with Knowledge Base Links

This site provides runs on Robust04 and Clueweb and entity features produced with the Entity-Query Feature Expansion method.

Cite as Jeffrey Dalton, Laura Dietz, and James Allan. "Entity Query Feature Expansion with Knowledge Base Links". SIGIR, 2014.

Download paper from Download files from

For questions, email Laura Dietz

Test Sets

Robust: Title queries of the Robust04 test set

ClueWeb09: Title queries of ClueWeb09 subset b. (Evaluated with qrels on the whole corpus (a)).

ClueWeb12: Title queries of ClueWeb12 subset b. (Evaluated with qrels on the whole corpus (a)).

Entity Annotations

Robust: entity links generated with out entity linking toolkit KB Bridge on pools of document. A NIL cutoff of 0.5 was used to determine non-linkable entity mentions.

Clueweb09 / Clueweb12: use of FACC1 annotations Facc1-09 Facc1-12

Extended Entity Annotations for Queries

The FACC1 annotations for Clueweb09 include entity annotations for queries. We found that these annotations were missing many entities and in some cases, wrong entities were annotated. We manually re-annotated the set of queries with freebase-ids and wikipedia titles. Annotation results are stored in tab-separated format in


Produced Document Rankings

Produced document ranking in TREC format for different methods and different test sets

robust04/ EQFE method (with all features)
robust04/ output of the baseline (using sdm retrieval model only)

Files for different test sets are in runs/*run


Using the tool trec_eval with the qrels file is the standard way to evaluate this benchmark. The default metric for Robust is mean-average precision (MAP). For Clueweb we evaluate with NDCG.

Outputs of the evaluation tool are in files ending with *.eval

Files for different test sets are in respective folders.

Spreadsheets combining statistics of explicit entities in the query and scores of different methods are available in tab separated format


Files for different test sets are in runs/*eval

Distributions over entities, types, etc per query

For each query, we kept the different distributions over entities, types, categories, and words. These are in the subdirectory distributions-per-query of the test set folders.


example for query 313:

313 top1-numEnts20 #combine:0=0.13153163581559393:1=0.09405689562633537:2=0.08859091597736103:3=0.08689525200551367:4=0.07180973343334467:5=0.06461367740452842:6=0.0511868942308846:7=0.05081374448780364:8=0.04715336915833395:9=0.04615250932713316:10=0.04061632381095417:11=0.038589454587693174:12=0.02573501402455092:13=0.025734627270149347:14=0.025038009950167073:15=0.023508427862664178:16=0.023402352922477086:17=0.022592570012126088:18=0.022139447871698078:19=0.01983914422068693(germany yamanashi vegas las hamburg california german prefecture transportation department miyazaki nevada orange hayao county level noise channel japan group)

names of most frequent 20 entity links (only the most confident --top1-- link is kept). #combine represents a distribution with exponential weights under a language model "0=", "1=" denotes the index in the following term list.

#wsum represents a multinomial distribution, #sdm is the sequential dependence retrieval model (combining unigrams, bigrams and skipbigrams of the terms)

Things with underscores are typically entity Ids, things with slashes are categories or types, of you see neither a slash not an underscore, it is used as a word.

Features per document (and per query)

Separated by query, we kept different features for each document. These are in the zipped directory

robust04/$queryId$ features from matched entities, types, etc (in svm light format)
robust04/ translating feature numbers to feature names
robust04/generated-feature-names-to-paper.tsv translating feature names to friendly names used in the paper