Comment j'ai recontré ton snowclone : Découverte non-supervisée de moules de snowclones dans de grands jeux de données
Julien Bezançon, Gaël Lejeune, Marceau Hernandez
Résumé : Snowclones are a type of Multiword Expression (MWEs) pattern that includes open slots, i.e. positions that can be filled with various words. A key feature of snowclones is that the original MWE remains recognizable, carrying its meaning into the new form. However, previous work has not shown whether such substitutions are limited to fixed positions. In this paper, we propose to use Locality Sensitive Hashing to automatically extract snowclone patterns from the non-commercial IMDb dataset. This process results in the creation of the FROST lexicon, comprising 30,826 pattern candidates and 1,059,824 snowclone candidates distributed in 30 languages. We then annotate 1,500 discovered patterns and 1,000 snowclones from the FROST lexicon to assess its quality. Our findings suggest that most substitutions in snowclones occur at consistent positions. This work provides the first large-scale lexicon of snowclone-based MWEs and a method that can support future research on MWEs and snowclones discovery.
Mots clés : snowclone, expressions multi-mots, locality sensitive hashing, lexique