W0028: Wolverhampton Business English CorpusThe WBE was created by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the framework of the European Commision project LRsP&P (Language Resources Production & Packaging - LE4-8335). A survey of electronic language resources in the business domain carried out at Wolverhampton revealed that there are very few business corpora in existence, and almost none of them are widely accessible. There is significant demand for a business corpus, from both the NLP and pedagogic (language, business communication, and linguistics teachers and students) communities. The Wolverhampton Corpus of Written Business English is:
The corpus consists of 10,186,259 words from 23 different Web sites The data can contribute to a wide range of NLP tasks, including information retrieval, information extraction, summarisation, etc. The WBE was built using materials solely from the Web. However, this does not mean that the corpus gives access only to a restricted range of categories of texts. On the contrary, the amount of information available online allowed us to select from a wide variety of categories. These range from product descriptions, company press releases, and annual financial reports, to business journalism, academic research papers, political speeches and government reports. The texts have been grouped according to the source site. The corpus is distributed in three formats.
All the available files were converted to 8-bit ASCII format using ISO 8859-1. Characters with ASCII codes from 127—255 (also known as Extended ASCII) were manually checked in order to ensure the correct representation of the characters. The corpus was checked for spelling errors, but special care was taken to ensure that any variant spellings specific to the business domain were not wrongly corrected. A validation work was carried out by an external validator. It consisted of checking text files, tools, tagging and documentation. Click here to view the prices and browse other ressources belonging to this category |