Text classification model SloBERTa-Trendi-Topics 1.0

Name: Text classification model SloBERTa-Trendi-Topics 1.0
License: https://opensource.org/licenses/Apache-2.0

Čibej, Jaka; Kuzman, Taja; Ljubešić, Nikola; Kosem, Iztok; Ponikvar, Primož; Dobrovoljc, Kaja; Krek, Simon

dc.contributor.author	Čibej, Jaka
dc.contributor.author	Kuzman, Taja
dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Kosem, Iztok
dc.contributor.author	Ponikvar, Primož
dc.contributor.author	Dobrovoljc, Kaja
dc.contributor.author	Krek, Simon
dc.date.accessioned	2022-11-15T13:59:39Z
dc.date.available	2022-11-15T13:59:39Z
dc.date.issued	2022-10-28
dc.identifier.uri	http://hdl.handle.net/11356/1709
dc.description	The SloBerta-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts from various Slovene news sources included in the Trendi Monitor Corpus of Slovene (http://hdl.handle.net/11356/1590) such as "rtvslo.si", "sta.si", "delo.si", "dnevnik.si", "vecer.com", "24ur.com", "siol.net", "gorenjskiglas.si", etc. The texts were semi-automatically categorized into 13 categories based on the sections under which they were published (i.e. URLs). The set of labels was developed in accordance with related categorization schemas used in other corpora and comprises the following topics: "črna kronika" (crime and accidents), "gospodarstvo, posel, finance" (economy, business, finance), "izobraževanje" (education), "okolje" (environment), "prosti čas" (free time), "šport" (sport), "umetnost, kultura" (art, culture), "vreme" (weather), "zabava" (entertainment), "zdravje" (health), "znanost in tehnologija" (science and technology), "politika" (politics), and "družba" (society). The categorization process is explained in more detail in Kosem et al. (2022): https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf The model was trained on the labeled texts using the SloBERTa 2.0 contextual embeddings model (http://hdl.handle.net/11356/1397; also available at HuggingFace: https://huggingface.co/EMBEDDIA/sloberta) and validated on a development set of 1,293 texts using the simpletransformers library and the following hyperparameters: Train batch size: 8 Learning rate: 1e-5 Max. sequence length: 512 Number of epochs: 2 The model achieves a macro-F1-score of 0.94 on a test set of 1,295 texts (best for "črna kronika", "politika", "šport", and "vreme" at 0.98, worst for "prosti čas" at 0.83). Please note that the fastText-Trendi-Topics 1.0 text classification model is also available (http://hdl.handle.net/11356/1710) that is faster and computationally less demanding, but achieves lower classification accuracy.
dc.language.iso	slv
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf
dc.rights	Apache License 2.0
dc.rights.uri	https://opensource.org/licenses/Apache-2.0
dc.rights.label	PUB
dc.source.uri	https://sled.ijs.si/
dc.subject	text classification
dc.subject	topic classification
dc.subject	SloBERTa
dc.subject	Slovenian news articles
dc.title	Text classification model SloBERTa-Trendi-Topics 1.0
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
has.files	yes
branding	CLARIN.SI data & tools
demo.uri	https://huggingface.co/cjvt/sloberta-trendi-topics
contact.person	Jaka Čibej jaka.cibej@ijs.si Jožef Stefan Institute
sponsor	Ministry of Culture of the Republic of Slovenia JR-infrastruktura-SJ-2021-2022 SLED - Monitor corpus of Slovene and related resources nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	University of Ljubljana I0-0022 Network of Research Infrastructure Centres (MRIC) nationalFunds
files.count	1
files.size	408050141

Datoteke v tem vnosu

To je vnos

Publicly Available

z licenco:
Apache License 2.0

Ime: sloberta-trendi-topics_1.0.zip
Velikost: 389.15 MB
Format: application/zip
Opis: sloberta-trendi-topics_1.0
MD5: 5215f6c747bea240e9a8212fadef5730

Prenesi datoteko Predogled

Predogled datoteke

sloberta-trendi-topics_1.0
- sentencepiece.bpe.model781 kB
- pytorch_model.bin422 MB
- tokenizer_config.json576 B
- config.json1 kB
- training_args.bin3 kB
- tokenizer.json2 MB
- special_tokens_map.json353 B
- 00README.txt2 kB
- model_args.json2 kB

Prikaži enostavni zapis vnosa

Datoteke v tem vnosu

Partnerji

Partnerji

Repozitorij