talnarchives

Une archive numérique francophone des articles de recherche en Traitement Automatique de la Langue.

TArC. Un corpus d'arabish tunisien

Elisa Gugliotta, Marco Dinarelli

Abstract : TArC : Incrementally and Semi-Automatically Collecting a Tunisian arabish Corpus This article describes the collection process of the first morpho-syntactically annotated Tunisian arabish Corpus (TArC). Arabish is a spontaneous coding of Arabic Dialects (AD) in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the communication on digital devices. Arabish differs for each Arabic dialect and each arabish code-system is under-resourced. In the last few years, the attention of NLP on AD has considerably increased. TArC will be thus a useful support for different types of analyses, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses on the corpus. In order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and its encoding in Tunisian arabish.

Keywords : Tunisian arabish Corpus, Arabic Dialect, Arabizi. Volume 2 : Traitement Automatique des Langues Naturelles, pages 232–240.