Prikaži enostavni zapis vnosa

 
dc.contributor.author Kuzman, Taja
dc.contributor.author Čibej, Jaka
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Kosem, Iztok
dc.contributor.author Ponikvar, Primož
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Krek, Simon
dc.date.accessioned 2022-11-15T16:28:16Z
dc.date.available 2022-11-15T16:28:16Z
dc.date.issued 2022-10-28
dc.identifier.uri http://hdl.handle.net/11356/1710
dc.description The fastText-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts from various Slovene news sources included in the Trendi Monitor Corpus of Slovene (http://hdl.handle.net/11356/1590) such as "rtvslo.si", "sta.si", "delo.si", "dnevnik.si", "vecer.com", "24ur.com", "siol.net", "gorenjskiglas.si", etc. The texts were semi-automatically categorized into 13 categories based on the sections under which they were published (i.e. URLs). The set of labels was developed in accordance with related categorization schemas used in other corpora and comprises the following topics: "črna kronika" (crime and accidents), "gospodarstvo, posel, finance" (economy, business, finance), "izobraževanje" (education), "okolje" (environment), "prosti čas" (free time), "šport" (sport), "umetnost, kultura" (art, culture), "vreme" (weather), "zabava" (entertainment), "zdravje" (health), "znanost in tehnologija" (science and technology), "politika" (politics), and "družba" (society). The categorization process is explained in more detail in Kosem et al. (2022): https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf The model was trained on the labeled texts using the word embeddings CLARIN.SI-embed.sl 1.0 (http://hdl.handle.net/11356/1204) and validated on a development set of 1,293 texts using the fastText library, 1000 epochs, and default values for the rest of the hyperparameters (see https://github.com/TajaKuzman/FastText-Classification-SLED for the full code). The model achieves a macro-F1-score of 0.85 on a test set of 1,295 texts (best for "vreme" at 0.97, worst for "prosti čas" at 0.67). Please note that the SloBERTa-Trendi-Topics 1.0 text classification model is also available (http://hdl.handle.net/11356/1709) that achieves higher classification accuracy, but is slower and computationally more demanding.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf
dc.rights Apache License 2.0
dc.rights.uri https://opensource.org/licenses/Apache-2.0
dc.rights.label PUB
dc.source.uri https://sled.ijs.si/
dc.subject text classification
dc.subject Slovenian news articles
dc.subject fastText
dc.subject topic classification
dc.title Text classification model fastText-Trendi-Topics 1.0
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding CLARIN.SI data & tools
contact.person Jaka Čibej jaka.cibej@ijs.si Jožef Stefan Institute
sponsor Ministry of Culture of the Republic of Slovenia JR-infrastruktura-SJ-2021-2022 SLED - Monitor corpus of Slovene and related resources nationalFunds
sponsor University of Ljubljana I0-0022 Network of Research Infrastructure Centres (MRIC) nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
files.count 1
files.size 933399627


 Datoteke v tem vnosu

To je vnos
Publicly Available
z licenco:
Apache License 2.0
Icon
Ime
fasttext-trendi-topics_1.0.zip
Velikost
890.16 MB
Format
application/zip
Opis
fasttext-trendi-topics_1.0
MD5
242d1aec69e009acdabc64119f628da9
 Prenesi datoteko  Predogled
 Predogled datoteke  

Prikaži enostavni zapis vnosa