Summary of the paper

Title MADA+TOKAN: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization
Authors Nizar Habash, Owen Rambow and Ryan Roth
Abstract We describe the MADA+TOKAN toolkit, a versatile and freely available system that can derive extensive morphological and contextual information from raw Arabic text, and then use this information for a multitude of crucial NLP tasks. Applications include high-accuracy part-of-speech tagging, diacritization, lemmatization, disambiguation, stemming, and glossing. MADA operates by examining a list of all possible analyses for each word, and then selects the analysis that matches the current context best by means of support vector machine models that use 19 distinct, weighted morphological features. The selected analyses carry complete diacritic, lexemic, glossary and morphological information; thus all disambiguation decisions are made in one step. TOKAN takes the information provided by MADA to generate tokenized output in a wide variety of customizable formats. MADA, TOKAN and their support utilities are highly configurable, allowing users to extract and manipulate the exact information that they require. In this article we describe the features and capabilities of MADA+TOKAN, detail recent improvements, and provide examples of the toolkit's use.
Topics Availability and use of generic vs. task/domain specific LRs,
Monolingual and multilingual LRs,
Guidelines, standards, specifications, models and best practices for Arabic LRs
Full paper MADA+TOKAN: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization
Bibtex @InProceedings{HABASH09.24,
  author = {Nizar Habash, Owen Rambow and Ryan Roth},
  title = {MADA+TOKAN: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization},
  booktitle = {Proceedings of the Second International Conference on Arabic Language Resources and Tools},
  year = {2009},
  month = {April},
  date = {22-23},
  address = {Cairo, Egypt},
  editor = {Khalid Choukri and Bente Maegaard},
  publisher = {The MEDAR Consortium},
  isbn = {2-9517408-5-9},
  language = {english}
  }

Powered by ELDA © 2009 The MEDAR Consortium