talnarchives

Une archive numérique francophone des articles de recherche en Traitement Automatique de la Langue.

Modeling infant segmentation of two morphologically diverse languages

Georgia Rengina Loukatou, Sabine Stoll, Damian Blasi, Alejandrina Cristia

Abstract : A rich literature explores unsupervised segmentation algorithms infants could use to parse their input, mainly focusing on English, an analytic language where word, morpheme, and syllable boundaries often coincide. Synthetic languages, where words are multi-morphemic, may present unique difficulties for segmentation. Our study tests corpora of two languages selected to differ in the extent of complexity of their morphological structure, Chintang and Japanese. We use three conceptually diverse word segmentation algorithms and we evaluate them on both word- and morpheme-level representations. As predicted, results for the simpler Japanese are better than those for the more complex Chintang. However, the difference is small compared to the effect of the algorithm (with the lexical algorithm outperforming sub-lexical ones) and the level (scores were lower when evaluating on words versus morphemes). There are also important interactions between language, model, and evaluation level, which ought to be considered in future work.

Keywords : cross-linguistic variation, statistical learning, word segmentation, language acquisition.