Issue #4

Issue #4 | January 2023

Content

- Language Resources

- Legal Issues

- ELRA/ELDA Projects

- Evaluation Campaigns

- Dissemination

Language Resources

LRs @ELRA

LRs in the ELRA Catalogue this month

Since October 2022, 2 new Written corpora and 67 new Speech Resources are available in our catalogue.

Written Corpora

German Political Speeches Corpus

ISLRN: 381-445-879-769-5

This corpus consists of a collection of political speeches in German crawled from the online archive of the German presidency (Bundespraësident) and the Chancellery (Bundesregierung). For the German Presidency the speeches are available from July 1, 1984, to February 17, 2012, and the corpus contains a total of 1442 texts comprising 2 392 074 tokens. For the German Chancellery, the corpus contains a total of 1831 text comprising 3 891 588 tokens covering a period from December 11, 1998, to December 6, 2011. This corpus contains speeches from the Chancellor but also from other politicians.

Learner Corpus of Portuguese L2 – COPLE2

ISLRN: 936-320-703-366-7

The Learner Corpus of Portuguese as Second/Foreign Language (COPLE2) is a corpus of written and oral texts produced by students of Portuguese as Foreign/Second Language courses in the Instituto de Cultura e Língua Portuguesa (the Institute of Portuguese Language and Culture) (ICLP – FLUL) and by applicants for examinations in the Centro de Avaliação de Português Língua Estrangeira (Center for Evaluation of Portuguese as a Foreign Language) (CAPLE – FLUL). The corpus contains texts from learners with 15 different native languages (L1s) and proficiencies from A1 to C1, and covers different topics and types of tasks. It is encoded in TEI format through the TEITOK environment. The corpus includes at the moment a total of 182,474 tokens and 978 texts, classified according to the CEFR scales. The corpus contains annotations for part of speech, lemma and learner errors. All the information encoded is searchable through the CQP query language.

Speech Resources

LR Agreement with Datatang for 67 Speech Resources

ELRA and Datatang signed a Language Resources distribution agreement to release a total of 67 Speech Resources distributed by ELRA. With this agreement, ELRA is strengthening its position as the leading worldwide distribution centre and Datatang is getting more visibility on the European market.

Those resources were designed and collected to boost Speech Recognition in particular. They cover the following languages: Cantonese, Chinese Mandarin, Various dialects from China: Changsha, Kunming, Shanghai, Sichuan, Wuhan, Several variants of English (English from Australia, Canada, China, France, Germany, India, Italy, Japan, Korea, Latin America, Portugal, Russia, Singapore, Spain, United Kingdom, USA), French, German, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Portuguese (Brazilian), Russian, Spanish (including non-hispanic Spanish), Thai, Uyghur, Vietnamese.

See the detailed list of all 67 Language Resources from Datatang

ISLRN submissions

The International Standard Language Resource Number (ISLRN) provides Language Resources (LRs) with unique identifiers using a standardised nomenclature. This aims to ensure that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications in R&D projects, products evaluation and benchmark as well as in documents and scientific papers.

Latest figures

82 new ISLRN numbers assigned between October and December 2022
A total of 3342 ISLRN numbers assigned since January 2014
A total of 270 distinct languages.

The latest LRs for which an ISLRN number was requested and accepted are as follows:

More about ISLRN.

Legal Issues

Publication of the EU-US Draft Adequacy Decision

Following the Schrems II case annulling the validity of the Privacy Shield framework allowing for transatlantic transfers of personal data, the US and EU authorities have been working to put a new framework in place.

On December 13, 2022, the European Commission published a draft adequacy decision detailing the new redress mechanisms and other developments brought by the American authorities in order to get compliant with GDPR.

The next steps are the submission to the European Data Protection Board for its opinion and the approval by a committee of Member States representatives.

It is also possible that individuals, the Parliament, or the European Counsel may challenge the validity of this new framework before the European Court of Justice.

Full draft adequacy decision available here.

Out with the old Standard Contractual Clauses for data transfers between EU and non-EU Countries

After the issuance of new Standard Contractual Clauses (SCCs) for data transfers between EU and non-EU countries on 4 June 2021, the European Commission allowed controllers and processors to rely on the earlier version of the SCCs until December 27, 2022 only for contracts concluded before the September 27, 2022.

Now that the deadline has passed transfers between EU and non-EU countries can only be made pursuant to the new SCCs.

The new SCCs are available here.

Over €300 million fines against Meta group announced by the Irish Data Protection Commission

Following inquiries, the Irish Data Protection Commissioner announced two sanctions against the Meta group, to which Facebook, Instagram and Whatsapp belong.

On January 4, 2023, the Irish DPC imposed a €210 million fine on the Facebook service and a €180 million fine on the Instagram service. During this inquiry, it was found that Meta could not rely on the “contract” legal basis to process the personal data of its users and therefore was in breach of its transparency obligations.

On January 19, 2023, the Irish DPC imposed a fine of €5,5 million for the Whatsapp Service operated by the Meta group. During this inquiry, it was found out that Meta could not rely on the “contract” legal basis to process the personal data of its users and therefore was in breach of its transparency obligations.

Reports of the decisions are available here for the decision published on January 4, 2023 and here for the decision published on January 19, 2023.

Swedish presidency circulates option papers on the Data Act

On January 10, 2023, the Swedish Presidency seeked the Member States’ opinion on the most crucial aspect of the upcoming data law to resolve some of the most pending issues.

This paper looks to address the following questions:

SMEs exclusion of the Act

Business to Government data sharing

Protection of trade secrets

The full report is available here.

Berlin provides its position on the Data Act

Germany provided its position paper relative to the adoption of the upcoming Data Act to the Swedish Presidency. The paper covers the following points:

Clarification on the scope of the regulation especially products covered by the Act

Overlap and inconsistencies between the Data Act and the GDPR

Differentiation between data sharing conditions of Business to Business (B2B) and Business to Consumer (B2C) use cases

Protection of Trade Secrets

Expansion of unfair contractual protection to all companies

Contractual freedom regarding cloud switching

The full report is available here.

Event Review - CLARIN Café on the Text and Data Mining Exception

On November 8, 2022, CLARIN organised a CLARIN Café dedicated to the implementation of the Text and Data Mining Exception provided by the new Copyright Directive in the Digital Single Market.

The event featured presentations by Thomas Margoni from KU Leuven, Toby Bond from Bird & Bird, and Jan Hajic from Prague University.

Thomas Margoni gave an overview of the legislative framework for Text and Data Mining considering that the Text and Data Mining Exception as it is articulated today does not make the EU market attractive for Text and Data Mining due to legal uncertainties while creating a market for right-holders for the downstream markets (AI developments).

Toby Bond presented the state of the legislation in the post-Brexit United Kingdom. He also provided an outlook on the future of the Text and Data Mining legislation. He forecast that the UK government aim to implement a broad exception to Text and Data Mining to allow these operations for commercial and non-commercial organisations with no “opt-out” provision.

Jan Hajic presented the High-Performance Language Technology Project (HPLT) whose goal is to get large amounts of data in 30 languages, create Large Language Models (LLM) and make them available openly and for free on large repositories to the language community.

Full recording of the event can be found here and slides are available here.

ELRA/ELDA Projects

Information on the on-going projects

Conclusion of the ELRC initiative

The European Language Resource Coordination (ELRC) initiative has been officially concluded on January 16, 2023. The initial purpose of ELRC was to collect language data within the CEF-AT countries (EU Member States plus Iceland and Norway) to train eTranslation, the European Commission's MT service.

Since the beginning of the initiative in 2014, substantial achievements were reached as the figures below show:

3,306 LRs available on the ELRC-SHARE repository, 80% of which are freely re-usable.

86 workshops and 6 conferences organized throughout Europe to highlight the importance of language data and language technologies and to promote the collection of multilingual language data.

White Paper

Development of LTs : NER, Speech-to-Text, Social Media Translation, etc. See the CEF AT services page for more information on the available services.

Common European Language Data Space (LDS)

The Common European Language Data Space (LDS) project was launched on January 19, 2023. The 3-year project will aim at establishing a European platform and marketplace for the collection, creation, sharing and re-use of multilingual and multimodal language data.

The service contract has been established between the European Commission and the four partners consortium composed of:

German Research Center for Artificial intelligence (DFKI) (coordinator),
Evaluations and Language Resources Distribution Agency (ELDA),
Athena Research and Innovation Center in Information, Communication and Knowledge Technologies (ILSP),
SIA Tilde

ELRA, through its operational body ELDA, will be involved in several work packages.

More details will be provided soon on the dedicated website. In the meantime, you can subscribe to the @LangDataSpace Twitter account.

Language Technology Solutions - CNECT/LUX/2022/OP/0030

This call for tenders from the European Commission was published within the Digital Europe programme (DIGITAL). It aims to achieve three specific goals: 1. facilitate uptake by SMEs, NGOs, public administration, and academia of European machine translation services for websites; 2. support the creation of open-source European language speech recognition solutions; 3. carry out market studies on language technologies and widely disseminate their results to foster the take-up of language technologies in Europe.

ELRA, through its operational body ELDA, is involved in two of the funded projects which are described below.

LOT 1 - Solutions Supporting the Use of Automated Translations on Websites

The project was officially launched on December 12, 2022 under the name “European Multilingual Web (EMW)”. EMW consortium is coordinated by Tilde (Latvia) with the participation of ELDA (Evaluations and Language resources Distribution Agency, France), IDC (International Data Corporation), Ogilvy (SIA Guilty, Latvia) and Rīga Stradiņš University (Latvia).

It involves four major tasks respectively consisting of:

Task 1: carrying out a comprehensive and evidence-based market study on the multilingualism of websites.

Task 2: delivering a set of ready-to-use open-source automated website translation solutions, and their subsequent maintenance and support (including helpdesk), including regular updating of relevant documentation, as required.

Task 3: publishing a set of open-source automated website translation solutions developed during Task 2 on a dedicated solutions website and to achieve widespread use of the solutions with promotional activities, as well as to build awareness of EU actions to support and nurture multilingualism.

Task 4: developing and implementing the strategy to ensure the sustainability of the set of ready-to-use open-source websites automated translations solutions developed or supported under Task 2 after the end of the contract.

LOT 2 – Language Technologies Solutions

This call for tenders implementing the Digital Europe programme (DIGITAL) in the field of language technologies is to achieve three specific goals:

1. facilitate uptake by SMEs, NGOs, public administration, and academia of European machine translation services for websites;

2. support the creation of open source European language speech recognition solutions;

3. carry out market studies on language technologies and widely disseminate their results in order to foster the take-up of language technologies in Europe.

This call for tenders covers:

the creation and promotion of a set of ready-to-use open-source automated website translation solutions,
the creation of an open-source basic speech recognition prototype solution,
the conduct of a market research on language technologies and the wide dissemination of their results.

Evaluation Campaigns

Current campaigns

IWSLT 2023 Evaluation Campaign - https://iwslt.org/2023/

SemEval-2023 - The 17th International Workshop on Semantic Evaluation - https://semeval.github.io/SemEval2023/

VarDial Evaluation Campaign 2023 - https://sites.google.com/view/vardial-2023/shared-tasks

HaSpeeDe 3 (Hate Speech Detection) shared task within Evalita 2023 -http://www.di.unito.it/~tutreeb/haspeede-evalita23/

Dissemination

News from ELRA

Six Board members have reached the end of their term in 2022. Elections have been organized to replace them. The 2-step process started in November 2022 with the nomination of 7 candidates by the ELRA members and the ELRA Board and continues early 2023 with the online voting.

The 7 nominees were:

Elections results will be shared shortly on ELRA usual channels, including the @ELRANews Twitter account.

Language Resources and Evaluation Journal

The 4 Regular Issues were published in 2022 in Volume 56:

News from the community

SMaLL-100 Model

SMaLL-100, a Shallow Multilingual MT Model for Low-Resource Languages, is a compact and fast massively multilingual machine translation model covering more than 10K language pairs. It is a distilled version of the large 12B MTM-100 model released by Meta.

Scientists working on NMT for low-resource languages may be interested in SMaLL. Good pre-trained models are provided to develop MT for low resource language pair. A demo platform is also available to access MT for those 10,000 language pairs.

Models: https://huggingface.co/alirezamsh/small100

Online MT demo: https://huggingface.co/spaces/alirezamsh/small100

SMaLL-100A paper accepted at EMNLP 2022 can be found here: https://arxiv.org/abs/2210.11621

Call for proposals

European Commission recently published a €20 million call for proposals on Natural Language Understanding and Interaction in Advanced Language Technologies under the HORIZON EUROPE research program (topic ID: HORIZON-CL4-2023-HUMAN-01-03).

The call will close on 29 March 2023 and the evaluation is expected to happen during April and May 2023.

The call covers the following topics:

Improve context-aware human-machine interaction to increase understanding and exploitation of the interaction context and content in multimodal settings, thus increasing responsiveness of interactive AI solutions, such as smart assistants, conversational and dialogue systems, content generation models, etc.

Support and enhance seamless human-to-human communication across languages e.g., by means of automatic translation or interpretation (incl. automatic subtitling) in real time with a greater understanding of the communication context and the meaning involved in it.

Call for submssions

ERCIM News #132 has just been published.

Featuring the spec ial theme "Cognitive AI and Cobots" and showcasing remarkable achievements from research teams in Europe, this issue was coordinated by our guest editors Theodore Patkos (ICS-FORTH) and Zsolt Viharos (SZTAKI)..

Submissions to the next issue is #133, April 2023 on the Special Theme: "Data infrastructures and management" are open until February 28, 2023. See the call for contributions for more details.

ERCIM Twitter: @ercim_news and other social media.

Links

Tags

Latest News

Tag Cloud

ELRA Tweets

Share this page!

Links

Tags

Latest News

Tag Cloud

ELRA Tweets