Monitor corpus of Slovene Trendi 2023-02

Kosem, Iztok; Čibej, Jaka; Dobrovoljc, Kaja; Erjavec, Tomaž; Ljubešić, Nikola; Ponikvar, Primož; Šinkec, Mihael; Krek, Simon

dc.contributor.author	Kosem, Iztok
dc.contributor.author	Čibej, Jaka
dc.contributor.author	Dobrovoljc, Kaja
dc.contributor.author	Erjavec, Tomaž
dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Ponikvar, Primož
dc.contributor.author	Šinkec, Mihael
dc.contributor.author	Krek, Simon
dc.date.accessioned	2023-04-03T10:54:32Z
dc.date.available	2023-04-03T10:54:32Z
dc.date.issued	2023-03-27
dc.identifier.uri	http://hdl.handle.net/11356/1782
dc.description	The Trendi corpus is a monitor corpus of Slovene. It contains news from 107 different media websites, published by 72 different publishers. Trendi 2023-02 covers the period from January 2019 to February 2023, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). All the contents of the Trendi corpus are at the moment obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. Text classification models are available at http://hdl.handle.net/11356/1709 (Text classification model SloBERTa-Trendi-Topics 1.0), http://hdl.handle.net/11356/1710 (Text classification model fastText-Trendi-Topics 1.0), and https://huggingface.co/cjvt/sloberta-trendi-topics (SloBERTa model). At the moment, the corpus is not available as a dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers.
dc.language.iso	slv
dc.publisher	Jožef Stefan Institute
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreferencedby	http://euralex.org/wp-content/themes/euralex/proceedings/Euralex%202022/EURALEX2022_Pr_p230-239_Kosem.pdf
dc.relation.isreferencedby	https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf
dc.relation.replaces	http://hdl.handle.net/11356/1681
dc.relation.isreplacedby	http://hdl.handle.net/11356/1879
dc.source.uri	https://sled.ijs.si/
dc.subject	monitor corpus
dc.subject	news corpus
dc.subject	universal dependencies
dc.subject	temporal trends
dc.subject	topic attribution
dc.title	Monitor corpus of Slovene Trendi 2023-02
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
hidden	hidden
has.files	no
branding	CLARIN.SI data & tools
demo.uri	https://www.clarin.si/kontext/query?corpname=trendi
contact.person	Iztok Kosem iztok.kosem@ijs.si Jožef Stefan Institute
sponsor	Ministry of Culture of the Republic of Slovenia JR-infrastruktura-SJ-2021-2022 SLED - Monitor corpus of Slovene and related resources nationalFunds
sponsor	University of Ljubljana I0-0022 Network of Research Infrastructure Centres (MRIC) nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info	700529632 tokens
size.info	586576992 words
size.info	31141264 sentences
files.count	0
files.size	0

Show simple item record

Partners

Partners

Repository