CIIR Releases Swahili and Somali Query Translations of CLEF Bilingual Dataset

Center for Intelligent Information Retrieval (CIIR) researchers within the University of Massachusetts Amherst College of Information and Computer Sciences are providing a dataset that consists of Swahili and Somali queries translated from the CLEF 2000-2003 Campaign for Bilingual Ad-Hoc Retrieval Tracks (http://catalog.elra.info/en-us/repository/browse/ELRA-E0008/).

For researching on low-resource languages, the CIIR has produced an extension of 200 queries by translating all four years of bilingual queries (2000-2003) into Swahili and Somali, with topic set IDs of C001-C200 corresponding to the other languages that exist in the CLEF data. They used a translation organization to translate the title and description of the English queries from that topic set into Swahili and Somali languages. Somali is in the Afro-Asiatic language family, and Swahili is in the Niger-Congo language family. Both are mostly spoken in Africa.

More information can be found in their paper, “Simulating CLIR Translation Resource Scarcity using High-resource Languages,” by authors Hamed Bonab, James Allan, and Ramesh Sitaraman in the Proceedings of ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2019).

The dataset and paper can be downloaded at: https://ciir.cs.umass.edu/ictir19_simulate_low_resource.