Proceedings of the Second International Conference on Arabic Language Resources and Tools

Summary of the paper

Title	Linguistic Resources for Arabic Handwriting Recognition
Authors	Stephanie Strassel
Abstract	MADCAT (Multilingual Automatic Document Classification Analysis and Translation) is a five year DARPA program that will produce systems to automatically convert foreign language text images into English transcripts for use by humans and downstream processes including summarization and information extraction. The first two phases of MADCAT focus on handwritten Arabic. Linguistic Data Consortium (LDC) creates and distributes linguistic resources for MADCAT, including data, annotations, specifications and tools for system training and evaluation. To date LDC has recruited over 300 scribes from around the Arabic speaking world to produce handwritten text for MADCAT. A web-based collection toolkit supports scribe recruitment, registration, data assignment and tracking, progress reporting, quality control and compensation both at LDC and at remote collection sites. Handwritten pages are scanned at high resolution and manually annotated with information including bounding boxes for each line and word on the page. Corresponding digital text and English translations are generated, and the multiple data layers are unified into a single xml output file containing: a text layer consisting of source text, tokenization and sentence segmentation; an image layer consisting of bounding boxes; a scribe demographic layer consisting of scribe ID and partition (train/dev/test); and a document metadata layer. LDC has collected, annotated and distributed over 30,000 handwritten pages thus far, and collection continues at a rapid pace. Most linguistic resources developed for the program will also be published in LDC's catalog making them generally available to the larger research community; the MADCAT Phase 1 Training Corpus is expected to be published in late 2009.
Topics	Methods, tools and procedures for acquisition, creation, management, access, distribution and use of Arabic LRs, National and international activities and projects on Arabic, Guidelines, standards, specifications, models and best practices for Arabic LRs
Full paper	Linguistic Resources for Arabic Handwriting Recognition
Bibtex	@InProceedings{STRASSEL09.70, author = {Stephanie Strassel}, title = {Linguistic Resources for Arabic Handwriting Recognition}, booktitle = {Proceedings of the Second International Conference on Arabic Language Resources and Tools}, year = {2009}, month = {April}, date = {22-23}, address = {Cairo, Egypt}, editor = {Khalid Choukri and Bente Maegaard}, publisher = {The MEDAR Consortium}, isbn = {2-9517408-5-9}, language = {english} }