Najnovejše
corpus

Opis:
A hand-labeled training (50,000 tweets labeled twice) and evaluation set (10,000 tweets labeled twice) for hate speech on Slovenian Twitter. The data files contain tweet IDs, hate speech type, hate speech target, and ...
Ta vnos vsebuje 4 datotek(e) (5.19
MB).
Publicly Available



corpus

Opis:
The COPA-HR dataset (Choice of plausible alternatives in Croatian) is a translation of the English COPA dataset (https://people.ict.usc.edu/~gordon/copa.html) by following the XCOPA dataset translation methodology ...
Ta vnos vsebuje 3 datotek(e) (194.2
KB).
Publicly Available



toolService

Opis:
The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for ...
Ta vnos vsebuje 2 datotek(e) (1.29
GB).
Publicly Available



Največ ogledov
V preteklem tednu
corpus

Opis:
The corpus contains 256,567 documents from the Slovenian news portals 24ur, Dnevnik, Finance, Rtvslo, and Žurnal24. These portals contain political, business, economic and financial content. The submission contains 7 files: ...
Ta vnos vsebuje 8 datotek(e) (616.88
MB).
Publicly Available



corpus

Opis:
The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,7 billion tokens) written 2000 - 2018 and gathered from the digital ...
Ta vnos vsebuje 6 datotek(e) (42.11
GB).
Academic Use



lexicalConceptualResource

Opis:
The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of the word; (2) the lemma, the base-form of the ...
Ta vnos vsebuje 12 datotek(e) (16.27
MB).
Publicly Available


