talnarchives

Une archive numérique francophone des articles de recherche en Traitement Automatique de la Langue.

Exploring Data-Centric Strategies for French Patent Classification: A Baseline and Comparisons

You Zuo, Benoît Sagot, Kim Gerdes, Houda Mouzoun, Samir Ghamri Doudane

Abstract : This paper proposes a novel approach to French patent classification leveraging data-centric strategies. We compare different approaches for the two deepest levels of the IPC hierarchy: the IPC group and subgroups. Our experiments show that while simple ensemble strategies work for shallower levels, deeper levels require more sophisticated techniques such as data augmentation, clustering, and negative sampling. Our research highlights the importance of language-specific features and data-centric strategies for accurate and reliable French patent classification. It provides valuable insights and solutions for researchers and practitioners in the field of patent classification, advancing research in French patent classification.

Keywords : Patent Classification, Extreme Multilabel Text Classification, Deep Learning