Text Tokenization for Knowledge-free Automatic Extraction of Lexical Similarities

Aristomenis Thanopoulos, Nikos Fakotakis, George Kokkinakis

Abstract : Previous studies on automatic extraction of lexical similarities have considered as semantic unit of text the word. However, the theory of contextual lexical semantics implies that larger segments of text, namely non-compositional multiwords, are more appropriate for this role. We experimentally tested the applicability of this notion applying automatic collocation extraction to identify and merge such multiwords prior to the similarity estimation process. Employing an automatic WordNet-based comparative evaluation scheme along with a manual evaluation procedure, we ascertain improvement of the extracted similarity relations.

Keywords : Automatic methods, lexical similarity extraction, collocation extraction, automatic evaluation