Search catalogue    

 >  National projects
|

ELDA participates actively in many projects funded by the French government. Below is given the list of resources produced in the framework of those projects.

-   Modern French Corpus - funded by the French ministry of research

  • ELRA-W0032 Modern French Corpus including Anaphors tagging
    - In ELRA’s Catalogue -
  • Modern French Corpus - Hermès Corpus
    This corpus is stored on one CD and consists of about 170 articles from the Hermès periodical that are available in different file formats : HTML, SGML and Word. Morpho-syntactic tagging (operated with the morpho-syntactic tagger WinBrill) is provided with the SGML version.
  • Modern French Corpus - Syntsem Corpus
    The Syntsem project consists of a partially tagged corpus at the syntactic and semantic levels. The source corpora have been selected so that they represent a wide sampling of the French language and are divided into 5 topics : newspapers, human sciences, periodicals, literary texts and European institutional texts.

-   LRs produced within projects from the Technolangue programme funded by the French ministry of research

  • ELRA-E0018 ARCADE II Evaluation Package (Evaluation of multilingual corpora alignment systems)
    - In ELRA’s Catalogue -
    • Aligned Corpus : Aligned corpus of written text from Le Monde Diplomatique in Arabic-French (150 articles per language), Chinese-French (59 articles per language), Greek-French (50 articles per language), Japanese-French (52 articles per language), Persian-French (53 articles per language), and Russian-French (50 articles per language).
    • JOC Corpus : Written text from the MULTEXT JOC Corpus, reformatted into XML and UTF-8 encoded in English, French, German, Italian and Spanish consisting of 1 million words per language.
    • Sub-set of the Le Monde Diplomatique corpus : 30 aligned Arabic-French texts with tagged named entities.
  • ELRA-E0020 CESTA Evaluation Package (Evaluation of machine translation systems)
    - In ELRA’s Catalogue -
    • Development data for restricted-domain machine translation : Medical data consisting of about 20,000 words in Arabic and French for Arabic-to-French translation.
    • Development data for restricted-domain machine translation : Medical data consisting of about 20,000 words in English and French for English-to-French translation.
    • Test data for general language machine translation : General vocabulary consisting of about 20,000 words in English and French for English-to-French translation, and about 20,000 words in Arabic and French for Arabic-to-French translation.
    • Test data for restricted-domain machine translation : Medical data consisting of about 20,000 words and 200,000 masquing words for the English-to-French direction and similar figures for the Arabic-to-French direction.
  • ELRA-E0022 EQueR Evaluation Package (Evaluation of question answering systems)
    - In ELRA’s Catalogue -
    • Open-domain corpus of questions : This corpus contains 500 questions that were grouped as follows : 407 “simple Factual” questions (“Who is the President of Chile ?”), 32 “Definition” questions (“What is NATO ?” ), 31 “List” questions (“Which are the four main religions practiced in Hungary ?”) and 30 “Yes/No” questions (“Is there a TGV railway line from Paris to Valencia ?”).
    • Restricted-domain corpus of questions (“Medical domain”) : This corpus contains 200 questions that were divided as follows : 81 “Factual” questions (“What is the gene involved in aniridia ?”), 70 “Definition” questions (“What is a mental illness ?”), 25 “List” questions (“What are the four major symptoms of ovarian cancer ?”), and 24 “Yes/No” questions (“Is it possible for a child to be schizophrenic ?”).
    • News Corpus : This corpus (1.5 Gb.) contains many years worth of newspaper articles from Le Monde (from 1992 to 2000) and Le Monde Diplomatique (from 1992 to 2000), French Swiss news agency releases (SDA, Schweizerischen Depeschenagentur) and the French Senate’s reports on various issues. The whole corpus contains about 560000 documents ; about 460000 documents from Le Monde, 7800 from Le Monde Diplomatique, 65800 from SDA 1994-1995, and 570 documents from the French Senate’s reports.
    • Medical Corpus : The corpus of medical texts (approx. 140 MB) is composed of scientific articles and various references to “good medical practice”. The original formats of the Medical data are pdf and html files. The data is in the form of a single file with simple tags (document identifier, title and paragraph).
  • EASY (Evaluation of parsers)
    • Corpus of questions (TREC, Amaryllis) containing 137,000 words, 5,000 of which are syntactically annotated.
    • Corpus of 150 emails for 7,000 syntactically annotated words.
    • Le Monde and Senat annotated corpus consisting of about 235,000 words that include 9,000 syntactically annotated words.
  • ELRA-E0021 ESTER Evaluation Package (Evaluation of broadcast news automatic transcribing systems)
  • ELRA-S0241 ESTER Corpus

    - In ELRA’s Catalogue -
    • 60 hours of orthographically transcribed news broadcast, including annotations of named entities.
    • 1,700 hours of non-transcribed radio broadcast news recordings.
    • The textual resources distributed within the ESTER campain are mainly based on the archives from Le Monde newspaper 1987-2003 (ELRA-W0015) and the debates from the European Parliament (ELRA-W0023).
    • The evaluation tools allow to evaluation each task defined above.
    • Two guides and manuals were produced and are provided in the package distributed by ELDA :
      • Guide for the annotation of named entities
      • Specifications and evaluation protocol
  • ELRA-E0023 EVASY Evaluation Package(Evaluation of speech synthesis systems)
    - In ELRA’s Catalogue -
    • Evaluation of grapheme-phoneme module : Scripts, scoring tools and corpus of scientific articles/documents from the Action de Recherche Concertée (ARC) B3 Evaluation campaign (evaluation of speech synthesis systems).
    • Corpus for prosody evaluation : Corpus of proper names with about 8,000 entries.
    • Corpus for global evaluation : Corpus dedicated to MoS (Mean Opinion Score) and ACR (Absolute Category Rating) tests.

-   Other French projects




|