talnarchives

Une archive numérique francophone des articles de recherche en Traitement Automatique de la Langue.

Maninka Reference Corpus : A Presentation

Valentin Vydrin, Andrij Rovenchak, Kirill Maslinsky

Abstract : An annotated corpus of Guinean Maninka, Corpus Maninka de Référence (CMR), was published in April 2016. It includes two subcorpora: one contains texts originally written in Latin-based graphics (792,778 words), and the other one is composed of texts in N'ko alphabet (3,105,879 words). Both subcorpora are searchable in both Latin-based graphics and in N'ko. In the building CMR, the Daba software package (earlier developed for the Corpus Bambara de Référence) has been used. As the search tool, NoSketchEngine has been used, it was adapted to the right-to-left direction of the N'ko writing. All texts in N'ko were obtained in electronic format, most of them were converted from pre-Unicode fonts. The morphological annotation is based on the Malidaba electronic dictionary which is in an intermediary stage of compillation; much efforts is needed to bring it to a minimally acceptable state.

Keywords : Corpus Maninka de Référence, N'ko, Malidaba, corpus building. ߏߛߘ߬ߊߟ ߬ ߟ ߏߟ߲ߋߙߍߡߎߢ ߬ ߊߞ߲߬ߌߣ߬ߊߡ ߫ ߌߣ ߏߞߒ ߌߟߊߞߊߘߌߦ ߎ 87 Actes de la conférence conjointe JEP-TALN-RECITAL 2016, volume 11 : TALAF ߬ ߟ ߌߓߘߏߞߏ ߫ ߘ ߐߟߐߟߓ ߫ ߊߘ ߍ ߬ ߟߦߊߟ ߏߟ߲ߋߙߍߡߎߢ ߲ ߊߞ ߬ ߊߞ߲߬ߌߣ߬ߊߡ ߲߫ ߌߣ ߏߟ߲ߋߙߍߡߎߢ ߏߞߒ ߋ ߲ ߞ ߆߁߀߂ ߲߬ ߊߛ ߐ ߫ ߦߍ ߬ ߡߎ ߰ ߡ ߐߘߐߣߞ ߬ ߊ ،ߐ ߫ ߘߋ ߬ ߟ ߌߟߍߓߛ ߍ ߲߬ ߕ߬ ߊߟ ߋ ߫ ߦ ߏߟ߲ߋߙߍߡߎߢ ߲ ߊߞ ߬ ߊߞ߲߬ߌߣ߬ߊߡ .߫ ߊߟ ߈߇߇ ߂߉߇ ߋ ߲ ߘߊߡߎߞ ߋ ߫ ߦ ߏߟ߲ߋߙߍߡߎߢ ߏߞߒ .߫ߌߘ ߋ ߬ ߟ ߫ ߦߐ ߫ ߘߐߣߞ ߬ ߊ ،ߐ ߫ ߘߋ ߬ ߟ ߌߟߍߓߛ ߏߞߒ ߋ .߫ߌߘ ߉߇߈ ߅߀߁ ߃ ߋ ߲ ߘߊߡߎߞ ߋ ߫ ߘ ߌߟߍߓߛ ߏߞߒ ߫ ߊߟߥ ߐ ߫ ߘ ߌߟߍߓߛ ߍ ߲߬ ߕ߬ ߊߟ ߐ ߫ ߣߞ ߍ ߯ ߓ ߬ ߊߟ߬ߌߝ ߏ ߫ ߟ߲ߋߙߍߡߎߢ ߫ ߊߟ ߍ ߫ ߞߋ ߫ ߛ ߫ ߌߘ ߲ ߌߠߌߣߌߢ ߍ ߲ ߡ .ߐ ߬ ،߫ ߌߘ ߋ ߬ ߟ "ߊߓߘ" ߋ ߫ ߦߏ ߬ ،߫ ߌߘ ߎ ߬ ߠ ߲ ߊߙߎ ߬ ߡߋ ߫ ߦ ߐ ߲߬ ߛߏߞ ߏߟ߲ߋߙߍߡߎߢ ߲ ߊߞ ߲߫ ߊߣߊߡߓ ߫ ߊߘ ߍ ߲߬ ߓߐߘ ߍ ߲ ߡߏ ߲ ߙߍߡߎߢ ߋ ߫ ߌߘߊߘߎߞ ߍ ߬ ߠ ߲ ߊߠ߲ ߌߠߌߣߌߢ ߍ ߫ ߞ ߫ ߊߘߓ ߲ ߊߙߎ ߬ ߡ ߋ ߬ ߘߞ߬ߊߞ ߲߬ ߓߐߘ ߬ ߊ ߫ ߊߘߓ ߲ ߊ ،߫ ߌߘ ߋ ߲ ߙߍߡߎߢ NoSketchEngine.ߐ ߫ ߟߏ ߬ ߬ ߊߞ ߫ ߊߘ ߍ ߫ ߞߋ ߬ ߟ ߊߓߊߙ߯ߊߓ ߬ ߊ ߬ ߌߙ߬ߊߓ ،ߐ ߲߬ ߘߐ ߬ ߛ ߫ ߊߘߓ ߊߡߠ߲ߋߙߍߡߎߢ ߍ ߯ ߓ ߍߜߏߟߞ ߏߞߒ .ߐ ߫ ߘ ߊߢߍߓߛ ߏߞߒ ߎ ߯ ߓ ߬ ߊߟ߬ߌߝ ߏߟ߲ߋߙߍߡߎߢ ߋ ߫ ߦߎ ߬ ߠߍ ߫ ߘߎ ߬ ߟ ߊߡߠ߲ߋߙߍߡߎߢ ߲ ߌߢߍߓߛ "UNICODE" ߲߬ ߊߡߍ ߬ ߟߦߊߓ ߍ ߲ ߡߋ ߲ ߘߊߡߎߞ .ߐ ߬ .߫ߌߘ ߋ ߬ ߟ ߊߡߠ߲ߋߙߍߡߎߢ ߫ ߊߟߐߝߓߘߞ "ߊߓߘ߬ߌߟ߬ߊߡ" ߋ ߫ ߦ ߎߖߊߓ ߊߡ߬ߊߦ߬ߊߕ߲߬ߊߡ ߫ ߌߙ߬ߊߛߝ ߎ ߫ ߟ ߏ ߬ ،ߐ ߫ ߣߞ ߏ ߫ ߞߋ ߬ ߟ ߊߓߙ߯ߊߓ ߬ ߊߞ ߲߫ ߊߞ ߫ ߊߞ ߲ ߊ ،߫ ߊߝߘ ߋ ߫ ߦߏ ߬ ߫ ߌߣ߲߬ߊߦ ،ߐ ߫ ߟߝ ߲߫ ߊߓ ߬ ߊߞ ߍ ߲߬ ߓߐߘ ߫ ߊߡ ߲ ߊߟߐߝߐߘߞ .ߍ ߬ ߠ ߋ ،ߊߓߘ߬ߌߟ߬ߊߡ ،ߏߞߒ ،ߏߟ߲ߋߙߍߡߎߢ ߬ ߊߞ߲߬ߌߣ߬ߊߡ ߏߞߒ ،ߏߟ߲ߋߙߍߡߎߢ ߲ ߊߞ ߬ ߊߞ߲߬ߌߣ߬ߊߡ :ߎ ߲ ߘߊߡߎߞ ߐ ߲ ߘߐߛߌߟߦ ߭ߍ ߲ ߓߐߘ ߍ ߲ ߣߍߓߛ ߏߟߣ߲ߋߙߍߡߎߢ