Multilingual text genre classification model X-GENRE

Name: Multilingual text genre classification model X-GENRE
License: https://creativecommons.org/licenses/by-sa/4.0/

Kuzman, Taja; Ljubešić, Nikola

Show simple item record

dc.contributor.author	Kuzman, Taja
dc.contributor.author	Ljubešić, Nikola
dc.date.accessioned	2024-09-25T15:19:44Z
dc.date.available	2024-09-25T15:19:44Z
dc.date.issued	2024-09-25
dc.identifier.uri	http://hdl.handle.net/11356/1961
dc.description	The X-GENRE classifier is a text classification model that can be used for automatic genre identification. The model classifies texts to one of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion and Other (refer to the provided README file for the details on the labels). The model was shown to provide high classification performance on Albanian, Catalan, Croatian, Greek, English, Icelandic, Macedonian, Slovenian, Turkish and Ukrainian, and the zero-shot cross-lingual experiments indicate that it will likely provide comparable performance on all other languages that are supported by the XLM-RoBERTa model (see Appendix in the following paper for the list of covered languages: https://arxiv.org/abs/1911.02116). The model is based on the base-sized XLM-RoBERTa model (https://huggingface.co/FacebookAI/xlm-roberta-base). It was fine-tuned on the training split of an English-Slovenian X-GENRE dataset (http://hdl.handle.net/11356/1960), comprising of around 1,800 instances of Slovenian and English texts. Fine-tuning was performed with the simpletransformers library (https://simpletransformers.ai/) and the following hyperparameters were used: Train batch size: 8 Learning rate: 1e-5 Max. sequence length: 512 Number of epochs: 15 For the optimum performance, the genre classifier should be applied to documents of sufficient length (the rule of thumb is at least 75 words), the predictions of label "Other" should be disregarded, and only predictions, predicted with confidence higher than 0.8, should be used. With these post-processing steps, the model was shown to reach macro-F1 scores of 0.92 and 0.94 on English and Slovenian test sets respectively (cross-dataset scenario), macro-F1 scores between 0.88 and 0.95 on Croatian, Macedonian, Turkish and Ukrainian, and macro-F1 scores between 0.80 and 0.85 on Albanian, Catalan, Greek, and Icelandic (zero-shot cross-lingual scenario). Refer to the provided README file for instructions with code examples on how to use the model.
dc.language.iso	eng
dc.language.iso	slv
dc.language.iso	hrv
dc.language.iso	mkd
dc.language.iso	cat
dc.language.iso	ell
dc.language.iso	isl
dc.language.iso	sqi
dc.language.iso	tur
dc.language.iso	ukr
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://doi.org/10.3390/make5030059
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://macocu.eu/
dc.subject	text classification
dc.subject	automatic genre identification
dc.subject	genre classification
dc.subject	genre
dc.subject	fine-tuned massively multilingual pretrained language model
dc.title	Multilingual text genre classification model X-GENRE
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
has.files	yes
branding	CLARIN.SI data & tools
demo.uri	https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier
contact.person	Taja Kuzman taja.kuzman@ijs.si Jožef Stefan Institute
sponsor	Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	ARRS (Slovenian Research Agency) N06-0099 and FWO-G070619N Linguistic landscape of hate speech on social media nationalFunds
files.count	1
files.size	817815064

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: X-GENRE-classifier.zip
Size: 779.93 MB
Format: application/zip
Description: The files for the X-GENRE classifier and the README file
MD5: bb31cd768a69b3e217e151db68b9effc

Download file Preview

File Preview

X-GENRE-classifier
- README.md5 kB
- sentencepiece.bpe.model4 MB
- pytorch_model.bin1 GB
- tokenizer_config.json477 B
- config.json1 kB
- training_args.bin3 kB
- tokenizer.json16 MB
- special_tokens_map.json280 B
- model_args.json2 kB

Show simple item record

Files in this item

Partners

Partners

Repository