Prikaži enostavni zapis vnosa

 
dc.contributor.author Kosem, Iztok
dc.contributor.author Čibej, Jaka
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Ponikvar, Primož
dc.contributor.author Šinkec, Mihael
dc.contributor.author Krek, Simon
dc.date.accessioned 2023-12-09T17:51:13Z
dc.date.available 2023-12-09T17:51:13Z
dc.date.issued 2023-12-09
dc.identifier.uri http://hdl.handle.net/11356/1904
dc.description The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 70 publishers. Trendi 2023-11 covers the period from January 2019 to November 2023, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. This version adds texts from October to November 2023.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreferencedby http://euralex.org/wp-content/themes/euralex/proceedings/Euralex%202022/EURALEX2022_Pr_p230-239_Kosem.pdf
dc.relation.isreferencedby https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf
dc.relation.isreferencedby https://doi.org/10.4312/slo2.0.2023.1.161-188
dc.relation.replaces http://hdl.handle.net/11356/1879
dc.relation.isreplacedby http://hdl.handle.net/11356/1906
dc.source.uri https://sled.ijs.si/
dc.subject monitor corpus
dc.subject news corpus
dc.subject universal dependencies
dc.subject temporal trends
dc.subject topic attribution
dc.title Monitor corpus of Slovene Trendi 2023-11
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
hidden hidden
has.files no
branding CLARIN.SI data & tools
contact.person Iztok Kosem iztok.kosem@ijs.si Jožef Stefan Institute
sponsor Ministry of Culture of the Republic of Slovenia JR-infrastruktura-SJ-2021-2022 SLED - Monitor corpus of Slovene and related resources nationalFunds
sponsor University of Ljubljana I0-0022 Network of Research Infrastructure Centres (MRIC) nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info 831598256 tokens
size.info 696711581 words
size.info 2114122 texts
files.count 0
files.size 0


Prikaži enostavni zapis vnosa