talnarchives

Une archive numérique francophone des articles de recherche en Traitement Automatique de la Langue.

Language Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text Corpora

Gaël Dias, Sylvie Guilloré, Gabriel Pereira Lopes

Abstract : Multiword units are groups of words that occur together more often than expected by chance in sub-languages. Président de la république, coupe du monde and Traité de Maastricht are multiword units. Unfortunately, most of the machine-readable dictionaries contain clearly insufficient information about multiword unitsi. Therefore, their automatic extraction from corpora is an important issue not only for natural language processing but also for applications on Information Retrieval, Information Extraction and Machine Translation. In this paper, we propose a new extraction system based on a new association measure, the Mutual Expectation, and a new acquisition process based on an algorithm of local maxima, the LocalMax algorithm.