talnarchives

Une archive numérique francophone des articles de recherche en Traitement Automatique de la Langue.

Automating the Measurement of Linguistic Features to Help Classify Texts as Technical

Terry Copeck, Ken Barker, Sylvain Delisle, Stan Szpakowicz

Abstract : Text classification plays a central role in software systems which perform automatic information classification and retrieval. Occurrences of linguistic feature values must be counted by any mechanism that classifies or characterizes natural language text by topic, style, genre or, in our case, by the degree to which a text is technical. We discuss the methodology and key details of the feature value extraction process, paying attention to fast and reliable implementation. Our results are mixed but support continued investigation— while a significant level of automation has been achieved, the successfully extracted feature counts do not always correlate with technicality as strongly as anticipated.