Show simple item record

 
dc.contributor.author Kuzman, Taja
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2024-09-25T15:19:44Z
dc.date.available 2024-09-25T15:19:44Z
dc.date.issued 2024-09-25
dc.identifier.uri http://hdl.handle.net/11356/1961
dc.description The X-GENRE classifier is a text classification model that can be used for automatic genre identification. The model classifies texts to one of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion and Other (refer to the provided README file for the details on the labels). The model was shown to provide high classification performance on Albanian, Catalan, Croatian, Greek, English, Icelandic, Macedonian, Slovenian, Turkish and Ukrainian, and the zero-shot cross-lingual experiments indicate that it will likely provide comparable performance on all other languages that are supported by the XLM-RoBERTa model (see Appendix in the following paper for the list of covered languages: https://arxiv.org/abs/1911.02116). The model is based on the base-sized XLM-RoBERTa model (https://huggingface.co/FacebookAI/xlm-roberta-base). It was fine-tuned on the training split of an English-Slovenian X-GENRE dataset (http://hdl.handle.net/11356/1960), comprising of around 1,800 instances of Slovenian and English texts. Fine-tuning was performed with the simpletransformers library (https://simpletransformers.ai/) and the following hyperparameters were used: Train batch size: 8 Learning rate: 1e-5 Max. sequence length: 512 Number of epochs: 15 For the optimum performance, the genre classifier should be applied to documents of sufficient length (the rule of thumb is at least 75 words), the predictions of label "Other" should be disregarded, and only predictions, predicted with confidence higher than 0.8, should be used. With these post-processing steps, the model was shown to reach macro-F1 scores of 0.92 and 0.94 on English and Slovenian test sets respectively (cross-dataset scenario), macro-F1 scores between 0.88 and 0.95 on Croatian, Macedonian, Turkish and Ukrainian, and macro-F1 scores between 0.80 and 0.85 on Albanian, Catalan, Greek, and Icelandic (zero-shot cross-lingual scenario). Refer to the provided README file for instructions with code examples on how to use the model.
dc.language.iso eng
dc.language.iso slv
dc.language.iso hrv
dc.language.iso mkd
dc.language.iso cat
dc.language.iso ell
dc.language.iso isl
dc.language.iso sqi
dc.language.iso tur
dc.language.iso ukr
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://doi.org/10.3390/make5030059
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://macocu.eu/
dc.subject text classification
dc.subject automatic genre identification
dc.subject genre classification
dc.subject genre
dc.subject fine-tuned massively multilingual pretrained language model
dc.title Multilingual text genre classification model X-GENRE
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding CLARIN.SI data & tools
demo.uri https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier
contact.person Taja Kuzman taja.kuzman@ijs.si Jožef Stefan Institute
sponsor Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) N06-0099 and FWO-G070619N Linguistic landscape of hate speech on social media nationalFunds
files.count 1
files.size 817815064


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
X-GENRE-classifier.zip
Size
779.93 MB
Format
application/zip
Description
The files for the X-GENRE classifier and the README file
MD5
bb31cd768a69b3e217e151db68b9effc
 Download file  Preview
 File Preview  
  • X-GENRE-classifier
    • README.md5 kB
    • sentencepiece.bpe.model4 MB
    • pytorch_model.bin1 GB
    • tokenizer_config.json477 B
    • config.json1 kB
    • training_args.bin3 kB
    • tokenizer.json16 MB
    • special_tokens_map.json280 B
    • model_args.json2 kB

Show simple item record