home

QA@CLEF-2004

Resources



Judged Submissions of the CLEF-2004 QA Track

Eighteen groups participated in the CLEF-2004 QA evaluation exercise, submitting 48 runs in 19 different tasks.
Submissions have been judged by human assessors and grouped according to the target language of the tasks. Here you can download them (zip file).



Test Sets at CLEF-2003:

Three monolingual tasks (with Dutch, Italian and Spanish questions) and five bilingual tasks (where Dutch, French, German, Italian and Spanish queries searched for an answer in an English target corpus) were proposed at CLEF-2003.
Here are the original test sets that were distributed to participants. Each test collection is a plain text file. Please, visit last year's web site for further information about the format.
Correct answers were manually retrieved and are included in the "DISEQuA" and "Multisix" corpora (see below).

   Monolingual Tasks:
Dutch
Italian
Spanish
   Cross-language Tasks:
Dutch
French
German
Italian
Spanish



DISEQuA corpus:

The Dutch, Italian, Spanish and English collection of Questions and Answers was developed by three research groups: ITC-irst (Centro per la Ricerca Scientifica e Tecnologica, Trento - Italy), UNED (Spanish Distance Learning University, Madrid - Spain) and ILLC (Language and Inference Technology Group, University of Amsterdam - The Netherlands).
It is composed of 450 questions formulated into four languages. The answers have been manually searched in three document collections, which enables to test/train cross-language QA systems in twelve different combinations. The corpora in which the answers were retrieved are those licensed by the CLEF consortium in 2002: La Stampa and SDA newspaper/wire articles (year 1994) for Italian, EFE (year 1994) for Spanish and Algemeen Dagblad and NRC Handelsblad (years 1994 and 1995) for Dutch. Questions appear also in English, but they were not verified in an English document collection.
Reference publication (to be acknowledged whenever you use DISEQuA) is B. Magnini, S. Romagnoli, A. Vallin, J. Herrera, A. Peñas, V. Peinado, F. Verdejo, M. de Rijke, Creating the DISEQuA Corpus: a Test Set for Multilingual Question Answering, in Carol Peters, editor, Working Notes for the CLEF 2003 Workshop, 21-22 August, Trondheim, Norway, 2003.

For further information, read a short description of the corpus.
Here you can download the version 1.0 of DISEQuA (zip file).



Multisix corpus:

The test sets we used for the cross-language tasks at CLEF QA-2003 are collected in the Multisix corpus, is a collection of 200 English questions whose answers have been manually searched in the Los Angeles Times corpus (year 1994) licensed last year by CLEF. Each question has been translated into five languages: Dutch, French, German, Italian and Spanish, but no manual processing was conducted in other document collections.
Some typos were recently found and corrected in German questions, so some entries in the "Multisix corpus" are slightly different from those in the original test sets (that can be downloaded above).
Reference publication (to be acknowledged whenever you use Multisix) is B. Magnini, S. Romagnoli, A. Vallin, J. Herrera, A. Peñas, V. Peinado, F. Verdejo, M. de Rijke, The Multiple Language Question Answering Track at CLEF 2003. (see chapter "Gold Standard for the Cross-Language Tasks"), in Carol Peters, editor, Working Notes for the CLEF 2003 Workshop, 21-22 August, Trondheim, Norway, 2003.

For further information, read a short description.
Here you can download the revised version (v2) of the Multisix corpus (zip file).



Check input utilities:

Before submitting their results, participants should run this checking routine in order to detect format inconsistencies (invalid document numbers, missing data, etc..) in their runs. The submissions that are not compliant with the required format will not be assessed. For a detailed description of the answer format, please refer to the track guidelines.

Download the checking routine for CLEF-2003 QA track.
Download the checking routine for CLEF-2004 QA track.



Italian Translation of the TREC Questions:

ITC-irst has translated into Italian 1000 questions released for the QA track at TREC-2002 and 2003. They represent a good example of how CLEF questions for this year's tasks may look like, and they can be used for training.
Similarly to the DISEQuA corpus (see above), the translation of the two TREC question sets is given in two XML files, where queries are numbered and described according to the category they belong to (either FACTOID, LIST or DEFINITION) and their answer type, i.e. the instance they refer to.
Several kinds of answer types have been taken into account: LOCATION (a place), PERSON (someone's name or role), TIME (the date of an event), MEASURE (the amount of something), MATERIAL (a particular substance), HOW ( questions like "How did something happen?"), TITLE (the title of a song, movie, book, etc.), ACRONYM ( the meaning of an abbreviation) and OTHER (plants, animals, inanimate objects, etc.). In most of the cases, the right answer is provided.
This translation represents a growing resource, and you are all encouraged to add other languages and other useful descriptive tags.

Download the translation of the TREC-2002 questions. (zip file)
Download the translation of the TREC-2003 questions. (zip file)



Test Set for Italian Named-Entities Recognition:

Annotated text represent another useful resource you may use to test and improve your system. ITC-irst provides the transcribed text of Italian broadcasts, in which the entities LOCATION, PERSON and ORGANIZATION have been marked with tags, according to the NIST guidelines.

Download the test set. (tar.gz file)



French Translation of the TREC Questions:

The RALI group (Laboratoire de Recherche Appliquée en Linguistique Informatique) at the University of Montreal, Canada, has translated into French 1893 questions drawn from the TREC QA evaluation exercises.

The file is available at the RALI website.



Spanish Resources:

QA resources for Spanish (including the translation of the TREC questions) are available on the website of the NLP and IR Group at UNED (Madrid, Spain).

URL: http://terral.lsi.uned.es/QA/resources/



Finnish Resources:

The DOREMI research group at the University of Helsinki has posted some QA resources for Finnish, including translations of the CLEF 2003 and 2004 test sets.

URL: http://www.cs.helsinki.fi/research/doremi/interests/QAResources.shtml