talnarchives

Une archive numérique francophone des articles de recherche en Traitement Automatique de la Langue.

Developing Word Embedding Models for Scottish Gaelic

William Lamb, Mark Sinclair

Abstract : Developing Word Embedding Models for Scottish Gaelic We detail a preliminary project on encoding and evaluating word embeddings for Scottish Gaelic. Word embedding methodologies show promise for diverse natural language processing (NLP) tasks and can be built from raw, unstructured text. Accordingly, they are attractive for under-resourced languages like Gaelic. We instantiated three embedding models on two versions of a 5.8 million token corpus : 1) tokenised and 2) tokenised / lemmatised. Using a simple POS tagger, we quantitatively measured the syntactic similarity between nearest neighbours for each model’s vector-space representations of words. We also queried the models to assess their semantic specificity and breadth. Models built from the tokenised corpus exhibited the effects of data sparsity for semantically constrained queries. The lemmatised versions had more semantic robustness, but at the expense of inflectional sensitivity. We note divergences between the models and an apparent inverse relationship between their semantic and syntactic capacities. Finally, we highlight the promise of word embeddings for a range of future work and downstream applications. 31 Actes de la conférence conjointe JEP-TALN-RECITAL 2016, volume 6 : CLTW

Keywords : Scottish Gaelic, word embeddings, neural networks, natural language processing, word2vec, part-of-speech tagging.