W0037: The EMILLE/CIIL Corpus
The EMILLE/CIIL Corpus consists of three components: monolingual, parallel and annotated corpora.
The parallel corpus consists of 200,000 words of text in English and its accompanying translations in Hindi, Bengali, Punjabi, Gujarati and Urdu.
The annotated component includes the Urdu monolingual and parallel corpora annotated for parts-of-speech, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode.
References: Xiao, Z, McEnery, A., Baker, P. and Hardie, A. 2004. ‘Developing Asian language corpora: standards and practice’ in Sornlertlamvanich, V., Tokunaga, T. and Huang, C. (eds.) Proceedings of the Fourth Workshop on Asian Language Resources, pp. 1-8. March 25, Sanya.
For more information on the Emille project: http://bowland-files.lancs.ac.uk/corplang/emille/
This database is available for research use by academic organisations only. For a use by commercial organisations, a subset of the EMILLE/CIIL Corpus is available under the reference ELRA-W0038 The EMILLE Lancaster Corpus.
Click here to view the prices
and browse other ressources
belonging to this category