• Repository
  • About
  • Contact
  • CLARIN
  •  Login
  • English Slovenščina
  • CLARIN.SI repository
  • View Item
  •  
  • CLARIN logo
  •   Browse  
    •    All of the Repository  
      •   Issue Date
      •   Authors
      •   Titles
      •   Subjects
      •   Publisher
      •   Language
      •   Type
      •   Rights Label
  •   My Account  
    •    Login
  •   Statistics  
    •    Piwik StatisticsBETA
  •   General Information  
    •    Deposit
    •    Cite
    •    Submission Lifecycle
    •    FAQ
    •    About
    •    Help Desk
 
 

Monitor corpus of Slovene Trendi 2024-09

 
CLARIN.SI data & tools
  Authors
Kosem, Iztok ; et al.show everyone Kosem, Iztok ; Čibej, Jaka ; Dobrovoljc, Kaja ; Erjavec, Tomaž ; Ljubešić, Nikola ; Ponikvar, Primož ; Šinkec, Mihael ; Krek, Simon
  Item identifier
http://hdl.handle.net/11356/1976
 Project URL
https://sled.ijs.si/
 Referenced by
http://euralex.org/wp-content/themes/euralex/proceedings/Euralex%202022/EURALEX2022_Pr_p230-239_Kosem.pdf
https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf
https://doi.org/10.4312/slo2.0.2023.1.161-188
 Date issued
2024-10-04
 Type
corpus, text
 Size
980476572 tokens, 821587340 words, 2499003 texts
 Language(s)
Slovenian
 Description
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 76 publishers. Trendi 2024-08 covers the period from January 2019 to September 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem (iztok.kosem@ijs.si). This version adds texts from September 2024.
 Publisher
Jožef Stefan Institute
 
Centre for Language Resources and Technologies, University of Ljubljana
 Acknowledgement
Ministry of Culture of the Republic of Slovenia JR-infrastruktura-SJ-2021-2022 "SLED - Monitor corpus of Slovene and related resources"
University of Ljubljana I0-0022 "Network of Research Infrastructure Centres (MRIC)"
ARRS (Slovenian Research Agency) P6-0411 "Language Resources and Technologies for Slovene"
 Subject(s)
monitor corpus news corpus universal dependencies temporal trends topic attribution
 Collection(s)
CLARIN.SI data & tools
 
This item is replaced by a newer submission:
http://hdl.handle.net/11356/1981
Show full item record
 
 

Partners

  • Alpineon, d.o.o.
  • Amebis, d.o.o.
  • Institute of Contemporary History
  • Jožef Stefan Institute
  • National and University Library of Slovenia
  • Slovenian Language Technologies Society

Partners

  • University of Ljubljana
  • University of Maribor
  • University of Nova Gorica
  • University of Primorska
  • ZRC SAZU
  • ZRS Koper

Repository

  • Main page
  • Contact
  • Submission Lifecycle
  • FAQ
  • About and Policies

This platform runs under the software developed for the LINDAT/CLARIAH-CZ repository for linguistics, available on GitHub

CLARIN.SI is supported by the Ministry of Education, Science and Sport of the Republic of Slovenia
under the Programme of "Research Infrastructures".