dc.contributor.author | Čibej, Jaka |
dc.contributor.author | Kuzman, Taja |
dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Kosem, Iztok |
dc.contributor.author | Ponikvar, Primož |
dc.contributor.author | Dobrovoljc, Kaja |
dc.contributor.author | Krek, Simon |
dc.date.accessioned | 2022-11-15T13:59:39Z |
dc.date.available | 2022-11-15T13:59:39Z |
dc.date.issued | 2022-10-28 |
dc.identifier.uri | http://hdl.handle.net/11356/1709 |
dc.description | The SloBerta-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts from various Slovene news sources included in the Trendi Monitor Corpus of Slovene (http://hdl.handle.net/11356/1590) such as "rtvslo.si", "sta.si", "delo.si", "dnevnik.si", "vecer.com", "24ur.com", "siol.net", "gorenjskiglas.si", etc. The texts were semi-automatically categorized into 13 categories based on the sections under which they were published (i.e. URLs). The set of labels was developed in accordance with related categorization schemas used in other corpora and comprises the following topics: "črna kronika" (crime and accidents), "gospodarstvo, posel, finance" (economy, business, finance), "izobraževanje" (education), "okolje" (environment), "prosti čas" (free time), "šport" (sport), "umetnost, kultura" (art, culture), "vreme" (weather), "zabava" (entertainment), "zdravje" (health), "znanost in tehnologija" (science and technology), "politika" (politics), and "družba" (society). The categorization process is explained in more detail in Kosem et al. (2022): https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf The model was trained on the labeled texts using the SloBERTa 2.0 contextual embeddings model (http://hdl.handle.net/11356/1397; also available at HuggingFace: https://huggingface.co/EMBEDDIA/sloberta) and validated on a development set of 1,293 texts using the simpletransformers library and the following hyperparameters: Train batch size: 8 Learning rate: 1e-5 Max. sequence length: 512 Number of epochs: 2 The model achieves a macro-F1-score of 0.94 on a test set of 1,295 texts (best for "črna kronika", "politika", "šport", and "vreme" at 0.98, worst for "prosti čas" at 0.83). Please note that the fastText-Trendi-Topics 1.0 text classification model is also available (http://hdl.handle.net/11356/1710) that is faster and computationally less demanding, but achieves lower classification accuracy. |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf |
dc.rights | Apache License 2.0 |
dc.rights.uri | https://opensource.org/licenses/Apache-2.0 |
dc.rights.label | PUB |
dc.source.uri | https://sled.ijs.si/ |
dc.subject | text classification |
dc.subject | topic classification |
dc.subject | SloBERTa |
dc.subject | Slovenian news articles |
dc.title | Text classification model SloBERTa-Trendi-Topics 1.0 |
dc.type | toolService |
metashare.ResourceInfo#ContentInfo.detailedType | tool |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent | true |
has.files | yes |
branding | CLARIN.SI data & tools |
demo.uri | https://huggingface.co/cjvt/sloberta-trendi-topics |
contact.person | Jaka Čibej jaka.cibej@ijs.si Jožef Stefan Institute |
sponsor | Ministry of Culture of the Republic of Slovenia JR-infrastruktura-SJ-2021-2022 SLED - Monitor corpus of Slovene and related resources nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | University of Ljubljana I0-0022 Network of Research Infrastructure Centres (MRIC) nationalFunds |
files.count | 1 |
files.size | 408050141 |
Datoteke v tem vnosu

- Ime
- sloberta-trendi-topics_1.0.zip
- Velikost
- 389.15 MB
- Format
- application/zip
- Opis
- sloberta-trendi-topics_1.0
- MD5
- 5215f6c747bea240e9a8212fadef5730
- sloberta-trendi-topics_1.0
- sentencepiece.bpe.model781 kB
- pytorch_model.bin422 MB
- tokenizer_config.json576 B
- config.json1 kB
- training_args.bin3 kB
- tokenizer.json2 MB
- special_tokens_map.json353 B
- 00README.txt2 kB
- model_args.json2 kB