• Repozitorij
  • O repozitoriju
  • Kontakt
  • CLARIN
  •  Prijava
  • English Slovenščina
  • Repozitorij CLARIN.SI
  • Prikaz vnosa
  •  
  • CLARIN logo
  •   Brskanje  
    •    Celoten repozitorij  
      •   Datum izdaje
      •   Avtor
      •   Naslov
      •   Ključne besede
      •   Izdajatelj
      •   Jezik
      •   Vrsta
      •   Oznaka pravic
  •   Moj račun  
    •    Prijava
  •   Statistika  
    •    Statistika PiwikBETA
  •   Splošne informacije  
    •    O vnosu v repozitorij
    •    Citiranje
    •    Življenjski ciklus vnosa
    •    Pogosta vprašanja
    •    O repozitoriju
    •    Pomoč uporabnikom
 
 

Monitor corpus of Slovene Trendi 2023-09

 
CLARIN.SI data & tools
  Avtorji
Kosem, Iztok ; et al.prikaži vse Kosem, Iztok ; Čibej, Jaka ; Dobrovoljc, Kaja ; Erjavec, Tomaž ; Ljubešić, Nikola ; Ponikvar, Primož ; Šinkec, Mihael ; Krek, Simon
  Identifikator vnosa
http://hdl.handle.net/11356/1879
 URL projekta
https://sled.ijs.si/
 Dokumentirano v
http://euralex.org/wp-content/themes/euralex/proceedings/Euralex%202022/EURALEX2022_Pr_p230-239_Kosem.pdf
https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf
https://doi.org/10.4312/slo2.0.2023.1.161-188
 Datum objave
2023-10-23
 Vrsta
corpus, text
 Velikost
801222976 tokens, 671201830 words, 35579111 sentences, 2040615 texts
 Jezik(i)
Slovenian
 Opis
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 70 publishers. Trendi 2023-09 covers the period from January 2019 to September 2023, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. As opposed to the previous version of the corpus, this version adds texts from March to September 2023, adds topic classification to files previous mistakenly without them, and corrects some other minor errors.
 Izdajatelj
Jožef Stefan Institute
 
Centre for Language Resources and Technologies, University of Ljubljana
 Zahvala
Ministry of Culture of the Republic of Slovenia JR-infrastruktura-SJ-2021-2022 "SLED - Monitor corpus of Slovene and related resources"
University of Ljubljana I0-0022 "Network of Research Infrastructure Centres (MRIC)"
ARRS (Slovenian Research Agency) P6-0411 "Language Resources and Technologies for Slovene"
 Ključne besede
monitor corpus news corpus universal dependencies temporal trends topic attribution
 Zbirke
CLARIN.SI data & tools
 
Ta vnos je bil nadomeščen z novejšim.
http://hdl.handle.net/11356/1904
Prikaži polni zapis vnosa
 
 

Partnerji

  • Alpineon, d.o.o.
  • Amebis, d.o.o.
  • Inštitut za novejšo zgodovino
  • Institut "Jožef Stefan"
  • Narodna in univerzitetna knjižnica Slovenije
  • Slovensko društvo za jezikovne tehnologije

Partnerji

  • Univerza v Ljubljani
  • Univerza v Mariboru
  • Univerza v Novi Gorici
  • Univerza na Primorskem
  • ZRC SAZU
  • ZRS Koper

Repozitorij

  • Domača stran
  • Kontakt
  • Življenski ciklus vnosa
  • Pogosta vprašanja
  • O repozitoriju in pravilih uporabe

Repozitorij uporablja programsko opremo, ki je bila razvita za LINDAT/CLARIAH-CZ jezikoslovni repozitorij in je dostopna na GitHubu.

CLARIN.SI podpira Ministrstvo za izobraževanje, znanost in šport
v okviru programa "Raziskovalne infrastrukture".