talnarchives

Une archive numérique francophone des articles de recherche en Traitement Automatique de la Langue.

Topic Identification Challenge Based on Short Word History

Armelle Brun, Kamel Smaïli, Jean-Paul Haton

Abstract : This paper presents several methods for topic detection on newspaper articles based on either a general vocabulary or a set of topic vocabularies. Our topic detection methods will be applied to speech recognition framework. The originality and the difficulty of our work lies in the fact that both training and test corpora contain few words (less than 200 words for test corpora). Test corpora are very small because our objective is to identify topic and adapt the language model, after uttering only few words. Experiments show that beyond 60 words, topic detection methods are not reliable. On and after 80 words, topic detection rate reaches 82% for the two first hypotheses, which is promising due to the conditions of our experimentation.

Mots clés : anguage models, topic detection, tfidf, speech recognition, modeles de langage, détection de thèmes, reconnaissance de la parole