dc.contributor.author | Kuzman, Taja |
dc.contributor.author | Ljubešić, Nikola |
dc.date.accessioned | 2024-09-25T15:19:44Z |
dc.date.available | 2024-09-25T15:19:44Z |
dc.date.issued | 2024-09-25 |
dc.identifier.uri | http://hdl.handle.net/11356/1961 |
dc.description | The X-GENRE classifier is a text classification model that can be used for automatic genre identification. The model classifies texts to one of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion and Other (refer to the provided README file for the details on the labels). The model was shown to provide high classification performance on Albanian, Catalan, Croatian, Greek, English, Icelandic, Macedonian, Slovenian, Turkish and Ukrainian, and the zero-shot cross-lingual experiments indicate that it will likely provide comparable performance on all other languages that are supported by the XLM-RoBERTa model (see Appendix in the following paper for the list of covered languages: https://arxiv.org/abs/1911.02116). The model is based on the base-sized XLM-RoBERTa model (https://huggingface.co/FacebookAI/xlm-roberta-base). It was fine-tuned on the training split of an English-Slovenian X-GENRE dataset (http://hdl.handle.net/11356/1960), comprising of around 1,800 instances of Slovenian and English texts. Fine-tuning was performed with the simpletransformers library (https://simpletransformers.ai/) and the following hyperparameters were used: Train batch size: 8 Learning rate: 1e-5 Max. sequence length: 512 Number of epochs: 15 For the optimum performance, the genre classifier should be applied to documents of sufficient length (the rule of thumb is at least 75 words), the predictions of label "Other" should be disregarded, and only predictions, predicted with confidence higher than 0.8, should be used. With these post-processing steps, the model was shown to reach macro-F1 scores of 0.92 and 0.94 on English and Slovenian test sets respectively (cross-dataset scenario), macro-F1 scores between 0.88 and 0.95 on Croatian, Macedonian, Turkish and Ukrainian, and macro-F1 scores between 0.80 and 0.85 on Albanian, Catalan, Greek, and Icelandic (zero-shot cross-lingual scenario). Refer to the provided README file for instructions with code examples on how to use the model. |
dc.language.iso | eng |
dc.language.iso | slv |
dc.language.iso | hrv |
dc.language.iso | mkd |
dc.language.iso | cat |
dc.language.iso | ell |
dc.language.iso | isl |
dc.language.iso | sqi |
dc.language.iso | tur |
dc.language.iso | ukr |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://doi.org/10.3390/make5030059 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://macocu.eu/ |
dc.subject | text classification |
dc.subject | automatic genre identification |
dc.subject | genre classification |
dc.subject | genre |
dc.subject | fine-tuned massively multilingual pretrained language model |
dc.title | Multilingual text genre classification model X-GENRE |
dc.type | toolService |
metashare.ResourceInfo#ContentInfo.detailedType | tool |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent | true |
has.files | yes |
branding | CLARIN.SI data & tools |
demo.uri | https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier |
contact.person | Taja Kuzman taja.kuzman@ijs.si Jožef Stefan Institute |
sponsor | Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) N06-0099 and FWO-G070619N Linguistic landscape of hate speech on social media nationalFunds |
files.count | 1 |
files.size | 817815064 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- X-GENRE-classifier.zip
- Size
- 779.93 MB
- Format
- application/zip
- Description
- The files for the X-GENRE classifier and the README file
- MD5
- bb31cd768a69b3e217e151db68b9effc
- X-GENRE-classifier
- README.md5 kB
- sentencepiece.bpe.model4 MB
- pytorch_model.bin1 GB
- tokenizer_config.json477 B
- config.json1 kB
- training_args.bin3 kB
- tokenizer.json16 MB
- special_tokens_map.json280 B
- model_args.json2 kB