2024-03-29T06:23:19Zhttp://www.clarin.si/repository/oai/requestoai:www.clarin.si:11356/10252023-07-05T16:58:40Zhdl_11356_1023hdl_11356_1024
Reference corpus of historical Slovene goo300k 1.2
2015-05-07T18:23:33Z
http://hdl.handle.net/11356/1025
Erjavec, Tomaž
2015-05-07T18:23:33Z
goo300k is a manually annotated reference corpus of historical Slovene. It contains 1,100 pages (about 300,000 tokens) sampled from 89 texts from the period 1584-1899.
Each text contains extensive meta-data and per-page links to facsimiles, while the word tokens in the texts are annotated with their modernised word-form, lemma, part-of-speech, and, for archaic words, their nearest modern synonyms or short explanation.
The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers. Note that the vertical format does not contain all the information from the source TEI.
http://hdl.handle.net/11356/1025
Jožef Stefan Institute
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
historical language
word modernisation
lemmatisation
part-of-speech tagging
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/10262023-03-27T17:01:16Zhdl_11356_1023hdl_11356_1024
Semantic lexicon of Slovene sloWNet 3.1
2015-05-12T09:45:05Z
http://hdl.handle.net/11356/1026
Fišer, Darja
2015-05-12T09:45:05Z
sloWNet is the Slovene WordNet developed in the expand approach: it contains the complete Princeton WordNet 3.0 and over 70,000 Slovene literals. These literals have been added automatically using different types of existing resources, such as bilingual dictionaries, parallel corpora and Wikipedia. 33,000 literals have been subsequently hand-validated.
http://hdl.handle.net/11356/1026
Faculty of Arts, University of Ljubljana
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
semantic lexicon
synsets
synonyms
wordnet
semantic description
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10382023-03-27T17:01:16Zhdl_11356_1023hdl_11356_1024
List of Slovenian headwords 1.1
2015-06-12T13:21:48Z
http://hdl.handle.net/11356/1038
Jakopin, Primož
2015-06-12T13:21:48Z
A list of headwords from the collection "Besede slovenskega jezika" (Words of Slovenian Language).
http://hdl.handle.net/11356/1038
ZRC SAZU
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
https://creativecommons.org/licenses/by-nc/4.0/
headwords
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10292023-06-20T12:39:18Zhdl_11356_1023hdl_11356_1024
Training corpus ssj500k 1.3
2015-05-17T19:14:37Z
http://hdl.handle.net/11356/1029
Krek, Simon
Erjavec, Tomaž
Dobrovoljc, Kaja
Može, Sara
Ledinek, Nina
Holz, Nanika
2015-05-17T19:14:37Z
The ssj500k training corpus is based on two training corpora built within the JOS project (https://nl.ijs.si/jos/). It contains the jos100k corpus and additional material from the jos1M corpus forming a training corpus with 500,000 words, manually checked and annotated on the levels of tokenization, segmentation, morphosyntactic tagging, syntactic dependency parsing and named entities. The ssj500k corpus uses the JOS morphosyntactic tagset with 1,902 tags and dependencies with 10 labels. The part of the corpus annotated with dependency relations contains 11,411 sentences, named entities are annotated in the original jos100k part of the corpus.
http://hdl.handle.net/11356/1029
http://hdl.handle.net/11356/1052
Centre for Language Resources and Technologies, University of Ljubljana
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
tagging
dependency treebank
parsing
named entities
tokenisation
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/10332023-03-27T17:01:16Zhdl_11356_1023hdl_11356_1024
Morphological lexicon Sloleks 1.0
2015-05-26T21:06:33Z
http://hdl.handle.net/11356/1033
Dobrovoljc, Kaja
Krek, Simon
Holozan, Peter
Erjavec, Tomaž
Romih, Miro
2015-05-26T21:06:33Z
Sloleks is the reference morphological lexicon for Slovenian language, developed to be used in NLP applications and language manuals. Encoded in LMF XML, the lexicon contains approx. 100.000 most frequent Slovenian lemmas, their inflected or derivative word forms and the corresponding grammatical description. Lemmatization rules, part-of-speech categorization and the set of feature-value pairs follow the JOS morphosyntactic specifications. In addition to grammatical information, each word form is also given the information on its absolute corpus frequency and its compliance with the reference language standard.
http://hdl.handle.net/11356/1033
http://hdl.handle.net/11356/1039
Centre for Language Resources and Technologies, University of Ljubljana
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
morphology
lexicon
inflection
word forms
derivation
language standardization
LMF
lemmatisation
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10312023-07-05T16:58:40Zhdl_11356_1023hdl_11356_1024
Digital library and corpus of historical Slovene IMP 1.1
2015-05-22T14:09:23Z
http://hdl.handle.net/11356/1031
Erjavec, Tomaž
2015-05-22T14:09:23Z
The IMP digital library contains historical Slovene books and other publications, together 658 texts with over 45,000 pages from the period 1584-1919. Each text contains extensive meta-data, per-page links to facsimiles, and hand-corrected transcriptions with structural and editorial annotations.
These texts were annotated to be used as a language corpus. In the corpus each word is marked-up with its modernised form, lemma, and morphosyntactic description (fine grained PoS tag). Note that the annotations are automatic, so they contain a fair amount of errors.
The digital library is available in source TEI P5 XML and derived HTML. The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers, e.g. CWB and Sketch Engine. Note that the vertical format does not contain all the information from the source TEI.
http://hdl.handle.net/11356/1031
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
historical language
word modernisation
lemmatisation
digital library
TEI
corpus
Text
oai:www.clarin.si:11356/10432023-01-28T14:15:27Zhdl_11356_1023hdl_11356_1024
MULTEXT-East "1984" annotated corpus 4.0
2015-06-15T08:51:55Z
http://hdl.handle.net/11356/1043
Erjavec, Tomaž
Barbu, Ana-Maria
Derzhanski, Ivan
Dimitrova, Ludmila
Garabík, Radovan
Ide, Nancy
Kaalep, Heiki-Jaan
Kotsyba, Natalia
Krstev, Cvetana
Oravecz, Csaba
Petkevič, Vladimír
Priest-Dorman, Greg
QasemiZadeh, Behrang
Radziszewski, Adam
Simov, Kiril
Tufiş, Dan
Zdravkova, Katerina
2015-06-15T08:51:55Z
The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original (about 100,000 words in length), and its translations into a number of languages.
This version of the corpus contains the linguistically annotated texts, with each word tagged by its lemma and its MULTEXT(-East) morphosyntactic description (MSD, i.e., a fine-grained feature-structure based PoS tag).
The structurally annotated texts are a separate submission (http://hdl.handle.net/11356/1044), also with somewhat different languages.
http://hdl.handle.net/11356/1043
Jožef Stefan Institute
MULTEXT-East licence
https://nl.ijs.si/ME/mte-licence.txt
parallel corpus
part-of-speech tagging
multilingual
Slavic languages
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/10302023-03-27T17:01:16Zhdl_11356_1023hdl_11356_1024
Slovene lexical database 1.0
2015-05-20T19:45:40Z
http://hdl.handle.net/11356/1030
Gantar, Polona
Krek, Simon
Kosem, Iztok
Šorli, Mojca
Kocjančič, Polonca
Grabnar, Katja
Yerošina, Olga
Zaranšek, Petra
Drstvenšek, Nina
2015-05-20T19:45:40Z
Slovene Lexical Database was created between 2008 and 2012 and represents a comprehensive syntactic and semantic description of a selected set of Slovene words. The description was based exclusively on the analysis of reference corpora of Slovene.
The database is structured as a network of interrelated semantic and syntactic information about a particular word. Semantic level represents the top level in the hierarchy with the lexical unit as its core element. This includes all senses of the headwrd, multi-word expressions and phraseological units. Each sense is described with a short semantic indicator and/or whole-sentence definition which includes typical syntactic environment of the headword with the relevant number, form and semantic types in a valency frame (semantic frame). These are also reflected in a number of syntactic structures and corresponding collocations. All the higher types of information are confirmed by a selection of corpus examples. Multi-word expressions and phraseological units are treated independently from particular senses of the headword and have their own internal structure which requires the same types of information as single-word entries or senses.
http://hdl.handle.net/11356/1030
Centre for Language Resources and Technologies, University of Ljubljana
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
lexical database
semantic description
syntactic description
collocations
lexicalConceptualResource
Text
oai:www.clarin.si:11356/12502024-03-27T16:48:01Zhdl_11356_1023hdl_11356_1024
Collocations Dictionary of Modern Slovene KSSS 1.0
2019-09-21T08:59:51Z
http://hdl.handle.net/11356/1250
Kosem, Iztok
Gantar, Polona
Krek, Simon
Arhar Holdt, Špela
Čibej, Jaka
Laskowski, Cyprian
Pori, Eva
Klemenc, Bojan
Dobrovoljc, Kaja
Gorjanc, Vojko
Ljubešić, Nikola
2019-09-21T08:59:51Z
The database of the Collocations Dictionary of Modern Slovene 1.0 contains entries for 35,862 headwords (18,043 nouns, 5,148 verbs, 10,259 adjectives and 2,412 adverbs) and 7,310,983 collocations that were automatically extracted from the Gigafida 1.0 corpus. For the automatic extraction via the Sketch Engine API we used a specially adapted Sketch grammar for Slovene, and, based on manual evaluation, a set of parameters that determined: maximum number of collocates per grammatical relation, minimum frequency of a collocate, minimum frequency of a grammatical relation, minimum salience (logDice) score of a collocate, and minimum salience of a grammatical relation.
The procedure of automatic extraction, which produced a list of collocates (lemmas) in a particular relation, was followed by a set of post-processing steps:
- removal of collocations that were represented by repetitions of the same sentence
- preparation of full collocations by the addition of the headword, and, if needed, the third element in the grammatical relation (such as preposition). The headwords/collocates were also put in the correct case, depending on the grammatical relation.
- addition of IDs from the Slovenian morphological lexicon Sloleks (http://hdl.handle.net/11356/1230) to every element in the collocation.
http://hdl.handle.net/11356/1250
http://hdl.handle.net/11356/1933
Centre for Language Resources and Technologies, University of Ljubljana
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
collocations
dictionary
syntactic structures
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10322023-06-20T12:36:03Zhdl_11356_1023hdl_11356_1024
Lexicon of historical Slovene imp25k 1.1
2015-05-25T19:41:04Z
http://hdl.handle.net/11356/1032
Erjavec, Tomaž
2015-05-25T19:41:04Z
The imp25k lexicon of historical Slovene was created automatically from the goo300k and foo3M annotated corpora and contains attested and manually verified word forms and their annotations with examples of use. A lexicon entry contains the modern lemma with its part-of-speech and, for archaic words, its gloss (closest modern equivalent(s) or short explanation of their meaning). The lemma is followed by its modern word forms from the corpus (i.e. the complete paradigm of the lemma is not given), and each of these has all its attested historical word forms with examples of usage.
The lexicon is available in source TEI P5 XML and in the much smaller and simpler derived tabular format, which does not contain usage examples. In the latter, multi-word units are joined with the underscore. The 1st column is the word form, the 2nd its modern equivalent, the 3rd its modern lemma, 4th its PoS tag from the IMP morphosyntactic specification, and 5th (where present) the gloss, e.g.:
ako_ravno<TAB>akoravno<TAB>akoravno<TAB>C<TAB>čeprav<LF>
or
ak-li<TAB>ako_li<TAB>ako_li<TAB>C_Q<TAB><LF>
http://hdl.handle.net/11356/1032
Jožef Stefan Institute
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
historical language
modernisation
lemmatisation
TEI
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10342023-03-27T17:01:16Zhdl_11356_1023hdl_11356_1024
Written corpus ccKres 1.0
2015-06-01T08:57:14Z
http://hdl.handle.net/11356/1034
Logar, Nataša
Erjavec, Tomaž
Krek, Simon
Grčar, Miha
Holozan, Peter
2015-06-01T08:57:14Z
Corpus ccKres consists of 9,376 documents, each containing information about the source (e.g. newspapers, magazines), year of publication, text type (fiction, newspaper), the title and author if they are known. The corpus is POS-tagged and lemmatised, and encoded in XML TEI format (Text Encoding Initiative P5). The ccKres corpus contains approximately 9% of the Kres corpus, a balanced corpus of Slovene: http://eng.slovenscina.eu/korpusi/kres.
http://hdl.handle.net/11356/1034
Centre for Language Resources and Technologies, University of Ljubljana
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
TEI
corpus
Text
oai:www.clarin.si:11356/10352023-03-27T17:01:16Zhdl_11356_1023hdl_11356_1024
Written corpus ccGigafida 1.0
2015-06-01T09:01:03Z
http://hdl.handle.net/11356/1035
Logar, Nataša
Erjavec, Tomaž
Krek, Simon
Grčar, Miha
Holozan, Peter
2015-06-01T09:01:03Z
Corpus ccGigafida consists of paragraph samples from 31,722 documents, each containing information about the source (e.g. newspapers, magazines), year of publication, text type (fiction, newspaper), the title and author if they are known. The corpus is annotated with morphosyntactic descriptions (PoS-tagged) and lemmatised. It is encoded in XML TEI format (Text Encoding Initiative P5). The ccGigafida corpus contains approximately 9% of the Gigafida corpus, a reference corpus of Slovene: http://eng.slovenscina.eu/korpusi/gigafida.
The corpus is available in source TEI-like XML and in the simpler and smaller vertical format, used by various concordancers. The XML file has PoS (MSD) tags in Slovenian only, while the vertical file has tags both in Slovenian and English. The corpus is also available as plain text, on file per text.
http://hdl.handle.net/11356/1035
Centre for Language Resources and Technologies, University of Ljubljana
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
TEI
corpus
Text
oai:www.clarin.si:11356/10362023-07-05T16:58:41Zhdl_11356_1023hdl_11356_1024
Learners' corpus Šolar 1.0
2015-06-01T09:05:28Z
http://hdl.handle.net/11356/1036
Rozman, Tadeja
Stritar Kučuk, Mojca
Kosem, Iztok
Krek, Simon
Krapš Vodopivec, Irena
Arhar Holdt, Špela
Stabej, Marko
2015-06-01T09:05:28Z
Šolar consists of 2,703 texts written by students in Slovene secondary schools (age 15-19) and pupils in the 7th-9th grade of primary school (13-15), with a small percentage also from the 6th grade. School essays form the majority of the corpus (64.2%) while other material includes texts created during lessons, such as text recapitulations or descriptions, examples of formal applications etc. Part of the corpus is annotated with teachers' corrections using a custom system of labels.
http://hdl.handle.net/11356/1036
http://hdl.handle.net/11356/1214
Centre for Language Resources and Technologies, University of Ljubljana
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
developmental corpus
error annotation
student writing
corpus
Text
oai:www.clarin.si:11356/10372023-06-20T12:35:26Zhdl_11356_1023hdl_11356_1024
Training corpus jos1M 1.1
2015-06-06T22:24:21Z
http://hdl.handle.net/11356/1037
Erjavec, Tomaž
Krek, Simon
2015-06-06T22:24:21Z
The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions (fine grained PoS tags) and lemmas, with about one fourth of the most problematic annotations hand-validated.
The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers. Note that the vertical format does not contain all of the information from the source TEI.
http://hdl.handle.net/11356/1037
http://hdl.handle.net/11356/1213
Jožef Stefan Institute
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
https://creativecommons.org/licenses/by-nc/4.0/
tagging
lemmatisation
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/10442020-11-08T10:09:51Zhdl_11356_1023hdl_11356_1024
MULTEXT-East "1984" document corpus 4.0
2015-06-15T08:56:08Z
http://hdl.handle.net/11356/1044
Erjavec, Tomaž
Bruda, Ştefan
Dimitrova, Ludmila
Ide, Nancy
Kaalep, Heiki-Jaan
Krstev, Cvetana
Orav, Heili
Oravecz, Csaba
Paldre, Leho
Petkevič, Vladimír
Priest-Dorman, Greg
Simov, Kiril
Sinapova, Lydia
Sokolovsky, Paul
Sryvkin, Sergey
Tufiş, Dan
Utka, Andrius
Villandi, Viire
Vitas, Duško
Vuković, Olga
2015-06-15T08:56:08Z
The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original (about 100,000 words in length), and its translations into a number of languages.
This version of the corpus contains structurally annotated texts only, which contain elements such as the paragraph, the footnote, and highlighted text. In terms of linguistic annotations, the text contain names and sentences.
The linguistically annotated texts are a separate submission (http://hdl.handle.net/11356/1043) also with somewhat different languages.
http://hdl.handle.net/11356/1044
Jožef Stefan Institute
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
parallel corpus
multilingual
TEI
corpus
Text
oai:www.clarin.si:11356/10412023-06-20T12:37:55Zhdl_11356_1023hdl_11356_1024
MULTEXT-East free lexicons 4.0
2015-06-15T08:46:04Z
http://hdl.handle.net/11356/1041
Erjavec, Tomaž
Bruda, Ştefan
Derzhanski, Ivan
Dimitrova, Ludmila
Garabík, Radovan
Holozan, Peter
Ide, Nancy
Kaalep, Heiki-Jaan
Kotsyba, Natalia
Oravecz, Csaba
Petkevič, Vladimír
Priest-Dorman, Greg
Shevchenko, Igor
Simov, Kiril
Sinapova, Lydia
Steenwijk, Han
Tihanyi, Laszlo
Tufiş, Dan
Véronis, Jean
2015-06-15T08:46:04Z
The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of the word; (2) the lemma, the base-form of the word; (3) the MSD, the morphosyntactic description of the word-form, i.e., its fine-grained PoS tag, as defined in the MULTEXT-East morphosyntactic specifications.
This submission contains the freely available MULTEXT-East lexicons, while a separate submission (http://hdl.handle.net/11356/1042) gives those that are available only for non-commercial use.
http://hdl.handle.net/11356/1041
Jožef Stefan Institute
http://hdl.handle.net/11372/LRT-675
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
lemmatisation
inflection
part-of-speech tagging
multilingual
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10422023-01-28T14:15:27Zhdl_11356_1023hdl_11356_1024
MULTEXT-East non-commercial lexicons 4.0
2015-06-15T08:50:05Z
http://hdl.handle.net/11356/1042
Erjavec, Tomaž
Derzhanski, Ivan
Divjak, Dagmar
Feldman, Anna
Kopotev, Mikhail
Kotsyba, Natalia
Krstev, Cvetana
Petrovski, Aleksandar
QasemiZadeh, Behrang
Radziszewski, Adam
Sharoff, Serge
Sokolovsky, Paul
Vitas, Duško
Zdravkova, Katerina
2015-06-15T08:50:05Z
The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of the word; (2) the lemma, the base-form of the word; (3) the MSD, the morphosyntactic description of the word-form, i.e., its fine-grained PoS tag, as defined in the MULTEXT-East morphosyntactic specifications.
This submission contains the non-commercial MULTEXT-East lexicons, while a separate submission (http://hdl.handle.net/11356/1041) gives those that are freely available.
http://hdl.handle.net/11356/1042
Jožef Stefan Institute
http://hdl.handle.net/11372/LRT-675
MULTEXT-East licence
https://nl.ijs.si/ME/mte-licence.txt
lemmatisation
inflection
part-of-speech tagging
multilingual
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10392023-03-27T17:01:16Zhdl_11356_1023hdl_11356_1024
Morphological lexicon Sloleks 1.2
2015-06-14T07:49:46Z
http://hdl.handle.net/11356/1039
Dobrovoljc, Kaja
Krek, Simon
Holozan, Peter
Erjavec, Tomaž
Romih, Miro
2015-06-14T07:49:46Z
Sloleks is the reference morphological lexicon for Slovenian language, developed to be used in NLP applications and language manuals. Encoded in LMF XML, the lexicon contains approx. 100.000 most frequent Slovenian lemmas, their inflected or derivative word forms and the corresponding grammatical description. Lemmatization rules, part-of-speech categorization and the set of feature-value pairs follow the JOS morphosyntactic specifications. In addition to grammatical information, each word form is also given the information on its absolute corpus frequency and its compliance with the reference language standard.
Note that this entry updates Sloleks 1.0 by fixing various encoding and content errors.
The resource is further described in:
Kaja Dobrovoljc, Simon Krek and Tomaž Erjavec, 2017: The Sloleks Morphological Lexicon and its Future Development. In (Vojko Gorjanc, Polona Gantar, Iztok Kosem and Simon Krek, eds.): Dictionary of Modern Slovene: Problems and Solutions. Ljubljana University Press, Faculty of Arts. https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/download/2/1/47-1
http://hdl.handle.net/11356/1039
http://hdl.handle.net/11356/1230
Centre for Language Resources and Technologies, University of Ljubljana
http://hdl.handle.net/11356/1033
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
morphology
inflection
word forms
derivation
LMF
lemmatisation
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10402023-03-27T17:01:16Zhdl_11356_1023hdl_11356_1024
Spoken corpus Gos 1.0
2015-06-14T09:17:17Z
http://hdl.handle.net/11356/1040
Zwitter Vitez, Ana
Zemljarič Miklavčič, Jana
Krek, Simon
Stabej, Marko
Erjavec, Tomaž
2015-06-14T09:17:17Z
GOS is a corpus of spoken Slovene that includes the transcripts of approximately 120 hours of speech recorded in various situations: radio and TV shows, school lessons and lectures, private conversations between friends or within the family, work meetings, consultations, conversations in buying and selling situations, etc. All speech is transcribed in two versions – with pronunciation-based spelling and with standardized spelling – and it comprises over one million words. The corpus can be searched by means of the web concordancer where it is also possible to listen to the corresponding recordings: http://www.korpus-gos.net.
http://hdl.handle.net/11356/1040
http://hdl.handle.net/11356/1438
Centre for Language Resources and Technologies, University of Ljubljana
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
speech transcription
spoken corpus
TEI
corpus
Text
oai:www.clarin.si:11356/10472024-03-13T18:37:08Zhdl_11356_1023hdl_11356_1024
Japanese web corpus with difficulty levels jpWaC-L 1.0
2015-08-05T12:53:31Z
http://hdl.handle.net/11356/1047
Erjavec, Tomaž
Hmeljak Sangawa, Kristina
Kawamura, Yoshiko
2015-08-05T12:53:31Z
The corpus contains over 300 million words, with annotations of words and sentences describing their difficulty levels. Words are assigned levels of difficulty according to the Japanese Language Proficiency Test Content Specifications (2004). The difficulty level of the sentences is computed using various heuristics, based on the (difficulty level of) words, sentence length, etc. We distinguish 5 difficulty levels, from L0 (very difficult) to L4 (very easy).
The corpus was collected from the Web using WaCkY tools, part-of-speech tagged and lemmatised with Chasen. The Japanese Chasen tags have also been converted to English language based tags.
The corpora are made available in vertical format. Structural attributes are <text> and <s> (sentence). Each text gives its @url and @domain. Sentences have the @level attribute, which describes their difficulty level. The positional attributes are:
1. token, as it appears in the text
2. lemma of the word
3. Chasen tag, translated to English
4. original Chasen tag in Japanese
5. difficulty level of the word.
The complete corpus is also split into sub-corpora of sentences with the same difficulty level.
http://hdl.handle.net/11356/1047
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
difficulty level
teaching corpus
TEI
corpus
Text
oai:www.clarin.si:11356/10462023-03-27T17:01:16Zhdl_11356_1023hdl_11356_1024
Gos corpus n-grams 1.0
2015-08-01T13:55:47Z
http://hdl.handle.net/11356/1046
Dobrovoljc, Kaja
2015-08-01T13:55:47Z
This is a collection of n-grams extracted from the Gos corpus of spoken Slovene. http://hdl.handle.net/11356/1040. In addition to the separate lists of n-grams for tokens and their attributes (normalized form, morphosyntacic tag, lemma), an adjusted frequency list with statistical substring reduction has also been added (as described in O'Donnell 2011). Only n-grams within sentences have been counted.
http://hdl.handle.net/11356/1046
http://hdl.handle.net/11356/1195
Trojina, Institute for Applied Slovene Studies
Faculty of Arts, University of Ljubljana
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
n-grams
wordlist
multiword expressions
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10452023-03-27T17:01:16Zhdl_11356_1023hdl_11356_1024
KRES corpus n-grams 1.0
2015-07-23T07:41:49Z
http://hdl.handle.net/11356/1045
Dobrovoljc, Kaja
2015-07-23T07:41:49Z
This is a collection of n-grams extracted from the KRES corpus of written Slovene. In addition to the separate lists of n-grams for tokens and their attributes (morphosyntacic tag, lemma), an adjusted frequency list with statistical substring reduction has also been added (as described in O'Donnell 2011). Only n-grams within sentences have been counted.
http://hdl.handle.net/11356/1045
http://hdl.handle.net/11356/1193
Trojina, Institute for Applied Slovene Studies
Faculty of Arts, University of Ljubljana
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
n-grams
wordlist
multiword expressions
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10532023-03-27T17:01:17Zhdl_11356_1023hdl_11356_1024
IMP corpus n-grams 1.0
2016-02-16T09:39:41Z
http://hdl.handle.net/11356/1053
Dobrovoljc, Kaja
2016-02-16T09:39:41Z
This is a collection of n-grams extracted from the IMP corpus of historical Slovene (http://hdl.handle.net/11356/1031). In addition to the separate lists of n-grams for tokens and their attributes (modernised form, morphosyntacic tag, lemma), an adjusted frequency list with statistical substring reduction has also been added (as described in O'Donnell 2011). Only n-grams within sentences have been counted.
http://hdl.handle.net/11356/1053
http://hdl.handle.net/11356/1194
Trojina, Institute for Applied Slovene Studies
Faculty of Arts, University of Ljubljana
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
n-grams
wordlist
multiword expressions
historical language
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10482023-03-27T17:01:17Zhdl_11356_1023hdl_11356_1024
Emoji Sentiment Ranking 1.0
2016-04-14T23:00:09Z
http://hdl.handle.net/11356/1048
Kralj Novak, Petra
Smailović, Jasmina
Sluban, Borut
Mozetič, Igor
2015-09-15T17:38:49Z
A lexicon of 751 emoji characters with automatically assigned sentiment.
The sentiment is computed from 70,000 tweets, labeled by 83 human annotators
in 13 European languages.
The process and analysis of emoji sentiment ranking is described in the
paper: Kralj Novak P, Smailović J, Sluban B, Mozetič I (2015) Sentiment of Emojis. PLoS ONE 10(12): e0144296. doi:10.1371/journal.pone.0144296
http://hdl.handle.net/11356/1048
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
sentiment classification
emojis
Unicode
multilingual
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10512023-03-27T17:01:16Zhdl_11356_1023hdl_11356_1024
SNABI database for continuous speech recognition 1.2
2016-02-10T08:25:24Z
http://hdl.handle.net/11356/1051
Kačič, Zdravko
Horvat, Bogomir
Zögling Markuš, Aleksandra
Veronik, Robert
Rojc, Matej
Žgank, Andrej
Sepesy Maučec, Mirjam
Rotovnik, Tomaž
2016-02-10T08:25:24Z
The SNABI speech database can be used to train continuous speech recognition for Slovene language. The database comprises 1530 sentences, 150 words and the alphabet. 132 speakers were recorded, each reading 200 sentences or more. This resulted in more than 15,000 recordings of speech signal contained in the database. The recordings were done in studio (SNABI SI_SSQ) and through a telephone line (SNABI SI_SFN).
http://hdl.handle.net/11356/1051
Faculty of Electrical Engineering and Computer Science, University of Maribor
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
speech recognition
speech database
speech recordings
spoken corpus
corpus
Text
oai:www.clarin.si:11356/10492023-03-27T17:01:16Zhdl_11356_1023hdl_11356_1024
Tourism English-Croatian Parallel Corpus 2.0
2016-01-29T14:27:32Z
http://hdl.handle.net/11356/1049
Toral, Antonio
Esplà-Gomis, Miquel
Klubička, Filip
Ljubešić, Nikola
Papavassiliou, Vassilis
Prokopidis, Prokopis
Rubino, Raphael
Way, Andy
2016-01-29T14:27:32Z
Sentence aligned parallel corpus built by automatically crawling 25 websites from the tourism domain.
http://hdl.handle.net/11356/1049
Abu-MaTran project
CLARIN.SI User Licence for Internet Corpora
https://www.clarin.si/info/wp-content/uploads/2016/01/CLARIN.SI-WAC-2016-01.pdf
parallel corpus
tourism
multilingual
corpus
Text
oai:www.clarin.si:11356/10502023-03-27T17:01:16Zhdl_11356_1023hdl_11356_1024
Japanese-Slovene learner's dictionary jaSlo 3.1
2016-02-01T11:58:57Z
http://hdl.handle.net/11356/1050
Hmeljak, Kristina
Erjavec, Tomaž
Srdanović, Irena
2016-02-01T11:58:57Z
The jaSlo dictionary is primarily intended for Slovene students learning Japanese. For each entry, it contains the Japanese headword (kanji, hiragana or katakana, and romaji), its part-of-speech and difficulty level, Slovene language gloss, Slovene translation equivalents, and translated examples of use.
http://hdl.handle.net/11356/1050
Faculty of Arts, University of Ljubljana
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
dictionary
difficulty level
TEI
multilingual
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10522023-03-27T17:01:17Zhdl_11356_1023hdl_11356_1024
Training corpus ssj500k 1.4
2016-02-13T13:44:11Z
http://hdl.handle.net/11356/1052
Krek, Simon
Dobrovoljc, Kaja
Erjavec, Tomaž
Može, Sara
Ledinek, Nina
Holz, Nanika
2016-02-13T13:44:11Z
The ssj500k training corpus contains 500,000 words, manually annotated on the levels of tokenization, sentence segmentation, morphosyntactic tagging, lemmatisation, named entities, and, partially, syntactic dependencies. The ssj500k corpus uses the MULTEXT-East / JOS morphosyntactic tagset and the JOS dependency schema and is based on the jos100k and jos1M corpora. Note that this entry updates ssj500k 1.3 by fixing many annotation errors.
http://hdl.handle.net/11356/1052
http://hdl.handle.net/11356/1165
Centre for Language Resources and Technologies, University of Ljubljana
http://hdl.handle.net/11356/1029
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
tagging
dependency treebank
parsing
named entities
tokenisation
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/10542023-03-27T17:01:18Zhdl_11356_1023hdl_11356_1024
Twitter sentiment for 15 European languages
2016-04-25T21:45:18Z
http://hdl.handle.net/11356/1054
Mozetič, Igor
Grčar, Miha
Smailović, Jasmina
2016-02-23T10:08:53Z
The dataset contains over 1.6 million tweets (tweet IDs), labeled with sentiment by human annotators.
There are 15 Twitter corpora for the corresponding 15 European languages. The data can be used to train and evaluate Twitter sentiment classifiers, to compute annotator agreement, or to study the differences between language usage on Twitter.
The data analysis is described in the following papers:
I. Mozetič, M. Grčar, J. Smailović. Multilingual Twitter sentiment classification: The role of human annotators, PLoS ONE 11(5): e0155036, doi: 10.1371/journal.pone.e0155036, 2016.
(http://dx.doi.org/10.1371/journal.pone.0155036)
I. Mozetič, L. Torgo, V. Cerqueira, J. Smailović. How to evaluate sentiment classifiers for Twitter time-ordered data?, PLoS ONE 13(3): e0194317, doi: 10.1371/journal.pone.0194317, 2018.
(https://dx.doi.org/10.1371/journal.pone.0194317)
http://hdl.handle.net/11356/1054
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
sentiment classification
Twitter
inter-annotator agreement
annotator self-agreement
multilingual
corpus
Text
oai:www.clarin.si:11356/10562023-03-27T17:01:18Zhdl_11356_1023hdl_11356_1024
Inflectional lexicon hrLex 1.0
2016-03-05T19:59:52Z
http://hdl.handle.net/11356/1056
Ljubešić, Nikola
Klubička, Filip
2016-03-05T19:59:52Z
hrLex is an large inflectional lexicon of Croatian language where each entry consists of a (wordform, lemma, MSD) triple. The MSD tagset follows the revised MULTEXT-East V4 tagset for Croatian and Serbian, available at
https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping.
http://hdl.handle.net/11356/1056
http://hdl.handle.net/11356/1067
Faculty of Humanities and Social Sciences, University of Zagreb
GNU General Public Licence, version 3
https://opensource.org/licenses/GPL-3.0
lexicon
morphology
inflection
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10572023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Inflectional lexicon srLex 1.0
2016-03-05T20:00:37Z
http://hdl.handle.net/11356/1057
Ljubešić, Nikola
Klubička, Filip
2016-03-05T20:00:37Z
hrLex is an large inflectional lexicon of Serbian language where each entry consists of a (wordform, lemma, MSD) triple. The MSD tagset follows the revised MULTEXT-East V4 tagset for Croatian and Serbian, available at
https://github.com/ffnlp/sethr/blob/master/mte4r-upos.mapping.
http://hdl.handle.net/11356/1057
http://hdl.handle.net/11356/1066
Faculty of Humanities and Social Sciences, University of Zagreb
GNU General Public Licence, version 3
https://opensource.org/licenses/GPL-3.0
lexicon
morphology
inflection
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10552023-06-20T12:40:04Zhdl_11356_1023hdl_11356_1024
Corpus of comma placement Vejica 1.0
2016-03-03T17:32:46Z
http://hdl.handle.net/11356/1055
Holozan, Peter
2016-03-03T17:32:46Z
A collection of sentences demonstrating and correcting comma usage.
The sentences come from four sources:
- KUST: a Slovene learner corpus, https://nl.ijs.si/isjt06/proc/26_Stritar.pdf
- Šolar: a corpus of student writing, http://www.slovenscina.eu/korpusi/solar
- Lektor: a corpus of proof-reading corrections, http://www.slovenscina.eu/korpusi/lektor
- Wikipedija: https://sl.wikipedia.org/wiki/Glavna_stran
For Lektor, the comma corrections of proof-readers were used. For other texts, the comma errors were manually marked by Peter Holozan.
http://hdl.handle.net/11356/1055
http://hdl.handle.net/11356/1185
Amebis, d. o. o., Kamnik
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
comma placement
error annotation
manual annotation
corpus
Text
oai:www.clarin.si:11356/10582023-03-27T17:01:17Zhdl_11356_1023hdl_11356_1024
Croatian-English parallel corpus hrenWaC 2.0
2016-03-09T16:47:40Z
http://hdl.handle.net/11356/1058
Ljubešić, Nikola
Esplà-Gomis, Miquel
Ortiz Rojas, Sergio
Klubička, Filip
Toral, Antonio
2016-03-09T16:47:40Z
The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-level domain for Croatia. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext on the segment level is around 80% and on the word level around 84%.
http://hdl.handle.net/11356/1058
Jožef Stefan Institute
CLARIN.SI User Licence for Internet Corpora
https://www.clarin.si/info/wp-content/uploads/2016/01/CLARIN.SI-WAC-2016-01.pdf
parallel corpus
web corpus
multilingual
corpus
Text
oai:www.clarin.si:11356/10592023-03-27T17:01:17Zhdl_11356_1023hdl_11356_1024
Serbian-English parallel corpus srenWaC 1.0
2016-03-09T16:51:44Z
http://hdl.handle.net/11356/1059
Ljubešić, Nikola
Esplà-Gomis, Miquel
Ortiz Rojas, Sergio
Klubička, Filip
Toral, Antonio
2016-03-09T16:51:44Z
The srenWaC corpus consists of sentence aligned parallel Serbian-English texts crawled from the .rs top-level domain for Serbia. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext, given the evaluation results on other languages, can be estimated at 74% on the sentence level and 76% on the word level.
http://hdl.handle.net/11356/1059
Jožef Stefan Institute
CLARIN.SI User Licence for Internet Corpora
https://www.clarin.si/info/wp-content/uploads/2016/01/CLARIN.SI-WAC-2016-01.pdf
parallel corpus
web corpus
multilingual
corpus
Text
oai:www.clarin.si:11356/10602023-03-27T17:01:17Zhdl_11356_1023hdl_11356_1024
Finnish-English parallel corpus fienWaC 1.0
2016-03-09T17:05:19Z
http://hdl.handle.net/11356/1060
Ljubešić, Nikola
Esplà-Gomis, Miquel
Ortiz Rojas, Sergio
Klubička, Filip
Toral, Antonio
2016-03-09T17:05:19Z
The fienWaC corpus version 1.0 consists of parallel Finnish-English texts crawled from the .fi top-level domain for Finland. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext, given the evaluation results on other languages, can be estimated at 74% on the segment level and 76% on the word level.
http://hdl.handle.net/11356/1060
Jožef Stefan Institute
CLARIN.SI User Licence for Internet Corpora
https://www.clarin.si/info/wp-content/uploads/2016/01/CLARIN.SI-WAC-2016-01.pdf
parallel corpus
web corpus
multilingual
corpus
Text
oai:www.clarin.si:11356/10612023-03-27T17:01:16Zhdl_11356_1023hdl_11356_1024
Slovene-English parallel corpus slenWaC 1.0
2016-03-10T15:21:18Z
http://hdl.handle.net/11356/1061
Ljubešić, Nikola
Esplà-Gomis, Miquel
Ortiz Rojas, Sergio
Klubička, Filip
Toral, Antonio
2016-03-10T15:21:18Z
The slenWaC corpus version 1.0 consists of parallel Slovene-English texts crawled from the .si top-level domain for Slovenia. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext on the segment level is around 67% and on the word level around 68%.
http://hdl.handle.net/11356/1061
Jožef Stefan Institute
CLARIN.SI User Licence for Internet Corpora
https://www.clarin.si/info/wp-content/uploads/2016/01/CLARIN.SI-WAC-2016-01.pdf
parallel corpus
web corpus
multilingual
corpus
Text
oai:www.clarin.si:11356/10642023-07-05T16:58:46Zhdl_11356_1023hdl_11356_1024
Croatian web corpus hrWaC 2.1
2016-05-12T16:25:34Z
http://hdl.handle.net/11356/1064
Ljubešić, Nikola
Klubička, Filip
2016-05-12T16:25:34Z
The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration, morphosyntactically annotated and lemmatised. The corpus is shuffled by paragraphs. Each paragraph contains metadata on the URL, domain and language identification (Croatian vs. Serbian).
Version 2.0 of this corpus is described in http://www.aclweb.org/anthology/W14-0405. Version 2.1 contains newer and better linguistic annotations.
http://hdl.handle.net/11356/1064
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
web corpus
corpus
Text
oai:www.clarin.si:11356/10622023-07-05T16:58:45Zhdl_11356_1023hdl_11356_1024
Bosnian web corpus bsWaC 1.1
2016-05-12T15:14:59Z
http://hdl.handle.net/11356/1062
Ljubešić, Nikola
Klubička, Filip
2016-05-12T15:14:59Z
The Bosnian web corpus bsWaC was built by crawling the .ba top-level domain in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration, morphosyntactically annotated and lemmatised. The corpus is shuffled by paragraphs. Each paragraph contains metadata on the URL, domain and language identification (Bosnian vs. Croatian vs. Serbian).
Version 1.0 of this corpus is described in http://www.aclweb.org/anthology/W14-0405. Version 1.1 contains newer and better linguistic annotations.
http://hdl.handle.net/11356/1062
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
web corpus
corpus
Text
oai:www.clarin.si:11356/10632023-07-05T16:58:45Zhdl_11356_1023hdl_11356_1024
Serbian web corpus srWaC 1.1
2016-05-12T15:32:55Z
http://hdl.handle.net/11356/1063
Ljubešić, Nikola
Klubička, Filip
2016-05-12T15:32:55Z
The Serbian web corpus srWaC was built by crawling the .rs top-level domain in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration, morphosyntactically annotated and lemmatised. The corpus is shuffled by paragraphs. Each paragraph contains metadata on the URL, domain and language identification (Serbian vs. Croatian).
Version 1.0 of this corpus is described in http://www.aclweb.org/anthology/W14-0405. Version 1.1 contains newer and better linguistic annotations.
http://hdl.handle.net/11356/1063
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
web corpus
lemmatisation
corpus
Text
oai:www.clarin.si:11356/10652020-09-09T17:53:05Zhdl_11356_1023hdl_11356_1024
Post-edited and error annotated machine translation corpus PErr 1.0
2016-05-29T15:22:49Z
http://hdl.handle.net/11356/1065
Popović, Maja
Arčan, Mihael
2016-05-29T15:22:49Z
The PE²rr corpus contains source language texts from different domains along with their automatically generated translations into several morphologically rich languages, their post-edited versions, and error annotations of the performed post-edit operations. The main advantage of the corpus is the fusion of post-editing and error classification tasks, which have usually been seen as two independent tasks, although naturally they are not.
http://hdl.handle.net/11356/1065
Insight Centre for Data Analytics, National University of Ireland, Galway
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
parallel corpus
machine translation
post-editing
error annotation
manual annotation
multilingual
corpus
Text
oai:www.clarin.si:11356/10972023-11-21T11:56:57Zhdl_11356_1023hdl_11356_1024
Slovene sentiment lexicon KSS 1.1
2017-04-14T11:45:17Z
http://hdl.handle.net/11356/1097
Kadunc, Klemen
Robnik-Šikonja, Marko
2017-04-14T11:45:17Z
Slovene opinion lexicon KSS is based on the manually translated opinion lexicon of Hu & Liu (2004). The lexicon is updated with some positive and negative words typical for Slovenian language. There are three versions of the lexicon.
1. Lexicon containing all word forms extended with Sloleks, a lexicon of Slovene word forms. It contains 90,620 entries, 62,941 negative word forms and 27,679 positive word forms.
2. Lexicon containing only lemmas, containing 5,125 negative words and 1,911 positive words.
3. The original version used in (Kadunc & Robnik-Šikonja, 2016), containing 6,687 negative entries and 2,645 positive entries.
Each version of the lexicon contains two files, one for negative and one for positive words in a text format, one word per line. The lexicon also contains some multi-word units where the individual words are joined with an underscore, e.g. "bolezenska_znamenja".
The KSS lexicon was developed as part of BSc Thesis (Kadunc, 2016) and empirically evaluated on a corpus of web commentaries about different topics (business, politics, sport and other topics) from 4 Slovene web portals (RtvSlo, 24ur, Finance, Reporter). That corpus is available from http://hdl.handle.net/11356/1115
References:
1. Minqing Hu in Bing Liu (2004). Mining opinion features in customer reviews. In Proceedings of AAAI Conference on Artificial Intelligence, vol. 4, pp. 755–760 http://www.aaai.org/Papers/AAAI/2004/AAAI04-119.pdf
2. Klemen Kadunc (2016). Določanje sentimenta slovenskim spletnim komentarjem s pomočjo strojnega učenja. Diplomsko delo. Univerza v Ljubljani, Fakulteta za računalništvo in informatiko (in Slovene). http://eprints.fri.uni-lj.si/3317/
3. Klemen Kadunc, Marko Robnik-Šikonja (2016). Analiza mnenj s pomočjo strojnega učenja in slovenskega leksikona sentimenta. Conference on Language Technologies & Digital Humanities, Ljubljana (in Slovene), http://www.sdjt.si/wp/wp-content/uploads/2016/09/JTDH-2016_Kadunc-et-al_Analiza-mnenj-s-pomocjo-strojnega-ucenja.pdf
http://hdl.handle.net/11356/1097
Faculty of Computer and Information Science, University of Ljubljana
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
opinion lexicon
sentiment lexicon
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10662023-03-27T17:01:18Zhdl_11356_1023hdl_11356_1024
Inflectional lexicon srLex 1.1
2016-06-23T14:03:58Z
http://hdl.handle.net/11356/1066
Ljubešić, Nikola
2016-06-23T14:03:58Z
srLex is a large inflectional lexicon of Serbian language where each entry consists of a (wordform, lemma, MSD, frequency, per-million frequency) 5-tuple. The (wordform, lemma, MSD) triple frequencies are calculated on the srWaC v1.2 corpus. The MSD tagset follows the MULTEXT-East V5 tagset for Bosnian available at http://nl.ijs.si/ME/V5/msd/html/msd-bs.html.
http://hdl.handle.net/11356/1066
http://hdl.handle.net/11356/1073
Faculty of Humanities and Social Sciences, University of Zagreb
http://hdl.handle.net/11356/1057
GNU General Public Licence, version 3
https://opensource.org/licenses/GPL-3.0
lexicon
morphology
inflection
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10672023-06-20T12:40:37Zhdl_11356_1023hdl_11356_1024
Inflectional lexicon hrLex 1.1
2016-06-24T08:16:55Z
http://hdl.handle.net/11356/1067
Ljubešić, Nikola
2016-06-24T08:16:55Z
hrLex is a large inflectional lexicon of Croatian language where each entry consists of a (wordform, lemma, MSD, frequency, per-million frequency) 5-tuple. The (wordform, lemma, MSD) triple frequencies are calculated on the hrWaC v2.2 corpus. The MSD tagset follows the MULTEXT-East V5 tagset for Croatian available at https://nl.ijs.si/ME/V5/msd/html/msd-hr.html.
http://hdl.handle.net/11356/1067
http://hdl.handle.net/11356/1072
Faculty of Humanities and Social Sciences, University of Zagreb
http://hdl.handle.net/11356/1056
GNU General Public Licence, version 3
https://opensource.org/licenses/GPL-3.0
lexicon
morphology
inflection
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10692023-03-27T17:01:18Zhdl_11356_1023hdl_11356_1024
Spoken corpus Gos VideoLectures 1.0 (transcription)
2016-08-02T09:42:47Z
http://hdl.handle.net/11356/1069
Verdonik, Darinka
Potočnik, Tomaž
Sepesy Maučec, Mirjam
Erjavec, Tomaž
2016-08-02T09:42:47Z
Gos Videolectures is an add-on to the Gos reference speech corpus of Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos Videolectures recordings are a selection of public lectures available through web-portal Videolectures.net provided by the Jožef Stefan Institute, and covers in its first release 4.5 hours of speech.
This resource contains only the transcriptions of the corpus - the audio recordings are avaiable at CLARIN.SI handle http://hdl.handle.net/11356/1070.
All transcriptions for Gos Videolectures were done manually and carefully checked. The main guidelines for transcription were those of the Gos corpus (http://www.korpus-gos.net/Support/About). The transcription tool Transcriber 1.5.1 (http://trans.sourceforge.net/en/presentation.php) was used for making transcriptions. It can be also used for reading or exporting transcriptions (.trs files) to different formats.
The transcriptions comprise the TRS files with tabular metadata, their conversion to TEI and to the CWB vertical file format. Each recording has two TRS files, one with the phonetic and the other with the normalised transcription. The TEI and CWB encodings join these two transcriptions at the token level, with the normalised words being also automatically PoS tagged and lemmatised.
The corpus can be used for training continuous speech recognition for Slovene language, for phonetic research or any other research of Slovene academic speech.
http://hdl.handle.net/11356/1069
http://hdl.handle.net/11356/1158
Faculty of Electrical Engineering and Computer Science, University of Maribor
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
https://creativecommons.org/licenses/by-nc/4.0/
speech database
spoken corpus
academic speech
speech transcription
speech recognition
TEI
corpus
Text
oai:www.clarin.si:11356/10702023-03-27T17:01:18Zhdl_11356_1023hdl_11356_1024
Spoken corpus Gos VideoLectures 1.0 (audio)
2016-08-02T10:21:31Z
http://hdl.handle.net/11356/1070
VideoLectures.NET
2016-08-02T10:21:31Z
Gos VideoLectures is an add-on to the Gos reference speech corpus of Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos Videolectures recordings are a selection of public lectures available through web-portal Videolectures.net provided by the Jožef Stefan Institute, and covers in its first release 4.5 hours of speech.
This resource contains only the audio recordings of the corpus - the transcriptions are avaiable at CLARIN.SI handle http://hdl.handle.net/11356/1069.
http://hdl.handle.net/11356/1070
http://hdl.handle.net/11356/1159
VideoLectures.NET
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
https://creativecommons.org/licenses/by-nc-nd/4.0/
speech database
spoken corpus
academic speech
speech recognition
speech recordings
corpus
Text
oai:www.clarin.si:11356/10682023-06-20T12:41:07Zhdl_11356_1023hdl_11356_1024
Dataset of normalised Slovene text KonvNormSl 1.0
2016-09-01T23:00:07Z
http://hdl.handle.net/11356/1068
Ljubešić, Nikola
Zupan, Katja
Fišer, Darja
Erjavec, Tomaž
2016-07-27T12:15:05Z
Data used in the experiments described in:
Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany.
https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf
(https://www.linguistics.rub.de/konvens16/)
Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (*.orig.txt) and the data with hand-normalised words (*.norm.txt). The files are aligned by lines.
There are four datasets:
- goo300k-bohoric: historical Slovene, hard case (<1850)
- goo300k-gaj: historical Slovene, easy case (1850 - 1900)
- tweet-L3: Slovene tweets, hard case (non-standard language)
- tweet-L1: Slovene tweets, easy case (mostly standard language)
The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (https://nl.ijs.si/janes/english/).
The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.
http://hdl.handle.net/11356/1068
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
word normalisation
historical language
computer-mediated communication
experimental data
manual annotation
corpus
Text
oai:www.clarin.si:11356/10712023-03-27T17:01:18Zhdl_11356_1023hdl_11356_1024
Dataset of European Parliament roll-call votes and Twitter activities MEP 1.0
2016-08-05T13:01:15Z
http://hdl.handle.net/11356/1071
Cherepnalkoski, Darko
Karpf, Andreas
Mozetič, Igor
Grčar, Miha
2016-08-05T13:01:15Z
The resource consists of two datasets related to Members of the 8th European Parliament (MEPs). The first one is a dataset of 2,535 roll-call votes of MEPs until 2016-03-01. The second one is a dataset of 26,133 retweets between MEPs in the period between 2014-10-01 and 2016-03-01. The data can be used to examine the patterns of covoting and retweeting of MEPs and analyze the extent to which they are similar.
The resource is presented and used in the paper:
Darko Cherepnalkoski, Andreas Karpf, Igor Mozetič, Miha Grčar "Cohesion and coalition formation in the European Parliament: Roll-call votes and Twitter activities". PLoS ONE 11(11): e0166586, 2016. http://dx.doi.org/10.1371/journal.pone.0166586
The dataset contains 5 files, of which 3 contain metadata and 2 data.
The metadata comprises information about the Members of 8th European Parliament (MEPs) until 2016-03-01, about roll-call votes (RCV) and possible actions during a RCV. The first data file contains a matrix with the votes of all MEPs during all RCVs while the second contains the retweets between the MEPs.
http://hdl.handle.net/11356/1071
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
European parliament
roll-call votes
Twitter
multilingual
corpus
Text
oai:www.clarin.si:11356/10742023-03-27T17:01:18Zhdl_11356_1023hdl_11356_1024
Finnish web corpus fiWaC 1.0
2016-09-26T13:04:25Z
http://hdl.handle.net/11356/1074
Ljubešić, Nikola
Pirinen, Tommi
Toral, Antonio
2016-09-26T13:04:25Z
The Finnish web corpus fiWaC was built by crawling the .fi top-level domain in 2015 for both Finnish and English documents. The corpus was naively tokenised (via spaces), near-deduplicated on paragraph level and paragraph-shuffled. Each paragraph contains metadata on the URL and language identification. The Finnish (~1.7B tokens) and English (~2B tokens) parts of the corpus are organised in separate files.
http://hdl.handle.net/11356/1074
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
web corpus
corpus
Text
oai:www.clarin.si:11356/10722023-06-20T12:41:39Zhdl_11356_1023hdl_11356_1024
Inflectional lexicon hrLex 1.2
2016-09-19T13:02:52Z
http://hdl.handle.net/11356/1072
Ljubešić, Nikola
Klubička, Filip
Boras, Damir
2016-09-19T13:02:52Z
hrLex is a large inflectional lexicon of Croatian language where each entry consists of a (wordform, lemma, MSD, frequency, per-million frequency) 5-tuple. The (wordform, lemma, MSD) triple frequencies are calculated on the hrWaC v2.2 corpus. The MSD tagset follows the MULTEXT-East V5 tagset for Croatian available at https://nl.ijs.si/ME/V5/msd/html/msd-hr.html.
http://hdl.handle.net/11356/1072
http://hdl.handle.net/11356/1232
Faculty of Humanities and Social Sciences, University of Zagreb
http://hdl.handle.net/11356/1067
GNU General Public Licence, version 3
https://opensource.org/licenses/GPL-3.0
morphology
inflection
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10732023-06-20T12:42:16Zhdl_11356_1023hdl_11356_1024
Inflectional lexicon srLex 1.2
2016-09-19T13:11:03Z
http://hdl.handle.net/11356/1073
Ljubešić, Nikola
Klubička, Filip
Boras, Damir
2016-09-19T13:11:03Z
srLex is a large inflectional lexicon of Serbian language where each entry consists of a (wordform, lemma, MSD, frequency, per-million frequency) 5-tuple. The (wordform, lemma, MSD) triple frequencies are calculated on the srWaC v1.2 corpus. The MSD tagset follows the MULTEXT-East V5 tagset for Bosnian available at https://nl.ijs.si/ME/V5/msd/html/msd-bs.html.
http://hdl.handle.net/11356/1073
http://hdl.handle.net/11356/1233
Faculty of Humanities and Social Sciences, University of Zagreb
http://hdl.handle.net/11356/1066
GNU General Public Licence, version 3
https://opensource.org/licenses/GPL-3.0
morphology
inflection
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10752023-11-02T15:26:59Zhdl_11356_1023hdl_11356_1024
Slovenian parliamentary corpus (1990-1992) SlovParl 1.0
2016-10-24T10:28:29Z
http://hdl.handle.net/11356/1075
Pančur, Andrej
Šorn, Mojca
Erjavec, Tomaž
2016-10-24T10:28:29Z
The SlovParl corpus contains minutes of the Chamber of Associated Labour of the Assembly of the Republic of Slovenia for the legislative period 1990-1992, i.e. it covers the period before, during, and after Slovenia became an independent country in 1991. The corpus comprises 54 sessions, 13,894 speeches and almost 2.7 million words. The corpus contains extensive meta-data about the speakers, a typology of sessions etc. and structural and editorial annotations.
This item comprises three datasets:
- the corpus in TEI P5 (module Transcriptions of speech);
- the corpus in TEI P5 with added automatic linguistic annotation: tokenisation, MSD tagging and lemmatisation;
- the corpus in vertical format used by various concordancers, e.g. CWB and Sketch Engine; this format is simpler and smaller but does not contain all the information from the source TEI.
The SlovParl data originally come from https://github.com/SIstory/SlovParl, but have been converted to use TEI elements for speech. This version of the corpus corresponds to commit https://github.com/DARIAH-SI/CLARIN.SI/tree/5984661e7b19e054b3fb650f4d2d5d409b3d7e3d
The resource is presented in the paper:
Pančur, Andrej. "Označevanje zbirke zapisnikov sej slovenskega parlamenta s smernicami TEI." In the Proceedings of the Conference on Language Technologies & Digital Humanities (Tomaž Erjavec and Darja Fišer, eds.) 142-148. Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani, 2016. http://www.sdjt.si/wp/wp-content/uploads/2016/09/JTDH-2016_Pancur_Oznacevanje-zbirke-zapisnikov-sej-slovenskega-parlamenta.pdf
http://hdl.handle.net/11356/1075
http://hdl.handle.net/11356/1167
Institute of Contemporary History
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
Slovenian Parliament
parliamentary debates
TEI
corpus
Text
oai:www.clarin.si:11356/10782023-06-20T12:42:53Zhdl_11356_1023hdl_11356_1024
xLiMe Twitter Corpus XTC 1.0.1
2016-11-28T13:47:36Z
http://hdl.handle.net/11356/1078
Rei, Luis
Krek, Simon
Mladenić, Dunja
2016-11-28T13:47:36Z
The xLiMe Twitter Corpus contains tweets in German, Italian and Spanish manually annotated with part-of-speech, named entities, and message-level sentiment polarity. In total, the corpus contains almost 20K annotated messages and 350K tokens.
The corpus is described in
Luis Rei, Dunja Mladenić, Simon Krek. A Multilingual Social Media Linguistic Corpus. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities. 27–28 September 2016, Ljubljana, Slovenia. https://nl.ijs.si/janes/cmc-corpora2016/proceedings/
http://hdl.handle.net/11356/1078
Jožef Stefan Institute
The MIT License (MIT)
https://opensource.org/licenses/mit-license.php
social media
computer-mediated communication
Twitter
part-of-speech tagging
named entities
sentiment classification
multilingual
manual annotation
corpus
Text
oai:www.clarin.si:11356/10792023-03-27T17:01:17Zhdl_11356_1023hdl_11356_1024
CMC training corpus Janes-Tag 1.0
2016-12-22T10:10:03Z
http://hdl.handle.net/11356/1079
Erjavec, Tomaž
Fišer, Darja
Čibej, Jaka
Arhar Holdt, Špela
2016-12-18T16:00:18Z
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require higlhy accurate and reliable annotations.
The corpus is further described in:
ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf
Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1080.
http://hdl.handle.net/11356/1079
http://hdl.handle.net/11356/1081
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
computer-mediated communication
tokenisation
word normalisation
tagging
lemmatisation
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/10802023-03-27T17:01:18Zhdl_11356_1023hdl_11356_1024
CMC training corpus Janes-Norm 1.0
2016-12-22T10:10:03Z
http://hdl.handle.net/11356/1080
Erjavec, Tomaž
Fišer, Darja
Čibej, Jaka
Arhar Holdt, Špela
2016-12-18T16:10:03Z
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation and word normalisation of non-standard Slovene. The corpus is also automatically annotated with morphosyntactic descriptions and lemmas. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require higlhy accurate and reliable annotations.
The corpus is further described in:
ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf
Note that a related corpus, Janes-Tag is also available, cf. http://hdl.handle.net/11356/1079.
http://hdl.handle.net/11356/1080
http://hdl.handle.net/11356/1084
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
computer-mediated communication
tokenisation
word normalisation
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/10862023-07-05T16:58:49Zhdl_11356_1023hdl_11356_1024
CMC training corpus Janes-Syn 1.0
2017-01-03T11:38:46Z
http://hdl.handle.net/11356/1086
Arhar Holdt, Špela
Erjavec, Tomaž
Fišer, Darja
2017-01-03T11:38:46Z
Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard training and testing dataset for syntactic annotation of Slovene computer-mediated communication and for detailed linguistic explorations which require highly accurate and reliable annotations. Words in the dataset are normalised, lemmatised, PoS-tagged and syntactically annotated with the JOS dependency model (http://eng.slovenscina.eu/tehnologije/razclenjevalnik). The annotations on all levels were manually corrected.
The corpus creation and structure are described in:
ARHAR HOLDT, Špela, FIŠER, Darja, ERJAVEC, Tomaž, KREK, Simon. Syntactic annotation of Slovene CMC : first steps. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 27-28 September 2016, Ljubljana, Slovenia, 2016, pp. 3-6. https://nl.ijs.si/janes/cmc-corpora2016/proceedings/
Janes-Syn was created from two larger corpora that are also available in the repository: Janes-Norm (http://hdl.handle.net/11356/1084) and Janes-Tag (http://hdl.handle.net/11356/1123).
http://hdl.handle.net/11356/1086
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
computer-mediated communication
tokenisation
dependency treebank
syntactic annotation
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/10872023-07-05T16:58:50Zhdl_11356_1023hdl_11356_1024
CMC shortening corpus Janes-Kratko 1.0
2017-01-20T14:05:33Z
http://hdl.handle.net/11356/1087
Goli, Teja
Osrajnik, Eneja
Fišer, Darja
Erjavec, Tomaž
2017-01-20T14:05:33Z
Janes-Kratko is a corpus of Slovene tweets manually annotated with shortening phenomena according to the supplied typology covering different types of spelling, lexical and syntactic shortenings. The corpus was sampled from the Janes-Norm corpus (http://hdl.handle.net/11356/1084), which was manually annotated for tokenisation, sentence segmentation and word normalisation of non-standard Slovene and automatically annotated with morphosyntactic descriptions and lemmas.
The corpus is further described in:
GOLI, Teja, OSRAJNIK, Eneja, FIŠER, Darja. Analiza krajšanja slovenskih sporočil na družbenem omrežju Twitter. Proceedings of the Conference on Language Technologies & Digital Humanities, Ljubljana, Slovenia. 2016, pp. 77-82. http://www.sdjt.si/wp/dogodki/konference/jtdh-2016/zbornik/
http://hdl.handle.net/11356/1087
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
computer-mediated communication
Twitter
shortening phenomena
TEI
manual annotation
corpus
Text
oai:www.clarin.si:11356/10812023-03-27T17:01:18Zhdl_11356_1023hdl_11356_1024
CMC training corpus Janes-Tag 1.1
2016-12-28T11:40:50Z
http://hdl.handle.net/11356/1081
Erjavec, Tomaž
Fišer, Darja
Čibej, Jaka
Arhar Holdt, Špela
2016-12-28T11:40:50Z
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations.
The corpus is further described in:
ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf
Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1083.
http://hdl.handle.net/11356/1081
http://hdl.handle.net/11356/1085
Jožef Stefan Institute
http://hdl.handle.net/11356/1079
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
computer-mediated communication
tokenisation
word normalisation
tagging
lemmatisation
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/10832023-03-27T17:01:18Zhdl_11356_1023hdl_11356_1024
CMC training corpus Janes-Norm 1.1
2016-12-28T11:41:07Z
http://hdl.handle.net/11356/1083
Erjavec, Tomaž
Fišer, Darja
Čibej, Jaka
Arhar Holdt, Špela
2016-12-28T11:41:07Z
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation and word normalisation of non-standard Slovene. The corpus is also automatically annotated with morphosyntactic descriptions and lemmas. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations.
The corpus is further described in:
ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf
Note that a related corpus, Janes-Tag is also available, cf. http://hdl.handle.net/11356/1081.
http://hdl.handle.net/11356/1083
http://hdl.handle.net/11356/1084
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
computer-mediated communication
tokenisation
word normalisation
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/10842023-07-05T16:58:48Zhdl_11356_1023hdl_11356_1024
CMC training corpus Janes-Norm 1.2
2016-12-30T13:53:05Z
http://hdl.handle.net/11356/1084
Erjavec, Tomaž
Fišer, Darja
Čibej, Jaka
Arhar Holdt, Špela
2016-12-30T13:53:05Z
Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation and word normalisation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations.
A slightly older version of this corpus is described in:
ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf
Note that the corpus is also annotated with morphosyntactic descriptions and lemmas. These annotations are manual where the texts correspond to the Janes-Tag corpus (http://hdl.handle.net/11356/1085) and automatic for the other texts.
http://hdl.handle.net/11356/1084
http://hdl.handle.net/11356/1733
Jožef Stefan Institute
http://hdl.handle.net/11356/1083
http://hdl.handle.net/11356/1080
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
computer-mediated communication
tokenisation
word normalisation
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/10852023-07-10T12:58:45Zhdl_11356_1023hdl_11356_1024
CMC training corpus Janes-Tag 1.2
2016-12-30T14:02:38Z
http://hdl.handle.net/11356/1085
Erjavec, Tomaž
Fišer, Darja
Čibej, Jaka
Arhar Holdt, Špela
Ljubešić, Nikola
2016-12-30T14:02:38Z
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations.
A slightly older version of this corpus is described in:
ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf
Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1084.
http://hdl.handle.net/11356/1085
http://hdl.handle.net/11356/1123
Jožef Stefan Institute
http://hdl.handle.net/11356/1079
http://hdl.handle.net/11356/1081
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
computer-mediated communication
tokenisation
word normalisation
tagging
lemmatisation
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/10882023-07-05T16:58:50Zhdl_11356_1023hdl_11356_1024
Tweet comma corpus Janes-Vejica 1.0
2017-02-16T12:28:26Z
http://hdl.handle.net/11356/1088
Popič, Damjan
Zupan, Katja
Logar, Polona
Kavčič, Teja
Erjavec, Tomaž
Fišer, Darja
2017-02-16T12:28:26Z
Janes-Vejica is a corpus of Slovene tweets where commas are annotated with the reason for their (in)correct use, according to the supplied typology. The corpus was sampled from the Janes-Norm corpus (http://hdl.handle.net/11356/1084), which was manually annotated for tokenisation, sentence segmentation, and word normalisation, and automatically for morphosyntactic descriptions and lemmas.
The corpus is further described in:
POPIČ, Damjan, FIŠER, Darja, ZUPAN, Katja, LOGAR, Polona. Raba vejice v uporabniških spletnih vsebinah. Proceedings of the Conference on Language Technologies & Digital Humanities, Ljubljana, Slovenia. 2016, pp. 149-153. http://www.sdjt.si/wp/dogodki/konference/jtdh-2016/zbornik/
http://hdl.handle.net/11356/1088
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
computer-mediated communication
Twitter
comma placement
TEI
manual annotation
corpus
Text
oai:www.clarin.si:11356/10902023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
ZRCola 2
2017-04-26T11:35:47Z
http://hdl.handle.net/11356/1090
Ježovnik, Janoš
Weiss, Peter
Amebis, d.o.o.
2017-04-26T11:35:47Z
ZRCola is an input system designed mainly, although not exclusively, for linguistic use. It allows the user to combine basic letters with any diacritic marks and insert the resulting complex characters into the texts with ease.
The system is comprised of an input program and a font, which can also be installed separately. The font is based on the Unicode standard and includes a vastly enlarged set of Latin, Cyrillic and other characters for Slavic writing systems in the Private Use Area.
http://hdl.handle.net/11356/1090
ZRC SAZU
GNU General Public Licence, version 3
https://opensource.org/licenses/GPL-3.0
character input
Unicode
input system
toolService
Software
oai:www.clarin.si:11356/10912023-10-29T11:04:46Zhdl_11356_1023hdl_11356_1024
Dictionary of New Slovenian Words
2017-05-15T09:49:35Z
http://hdl.handle.net/11356/1091
Bizjak Končar, Aleksandra
Gložančev, Alenka
Kern, Boris
Kostanjevec, Polona
Krvina, Domen
Ledinek, Nina
Michelizza, Mija
Perdih, Andrej
Petric, Špela
Snoj, Marko
Šircelj Žnidaršič, Ivanka
Žele, Andreja
Mirtič, Tanja
Gliha Komac, Nataša
Klemenčič, Simona
2017-05-15T09:49:35Z
Slovar novejšega besedja slovenskega jezika (Dictionary of New Slovenian Words) represents a basic new lexical supplement to the Slovar slovenskega knjižnega jezika (Dictionary of the Slovenian Standard Language). It contains 6399 new words and phrases that appeared in Slovenian or gained ground after 1991 as well as new meanings of previously standardised lexis. Two important new features of the dictionary are a corpus-driven analysis of new words that are in actual language use and etymological explanations of the included words.
This dictionary was published as a printed book:
Bizjak Končar, Aleksandra, Snoj, Marko, Gložančev, Alenka, Kern, Boris, Kostanjevec, Polona, Krvina, Domen, Ledinek, Nina, Michelizza, Mija, Perdih, Andrej, Petric, Špela, Šircelj-Žnidaršič, Ivanka, Žele, Andreja, Mirtič, Tanja, Gliha Komac, Nataša, Klemenčič, Simona. Slovar novejšega besedja slovenskega jezika. Ljubljana : Založba ZRC, ZRC SAZU, 2012. ISBN 978-961-254-413-3.
http://hdl.handle.net/11356/1091
ZRC SAZU
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
dictionary
neologism
lexicography
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10922023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Dictionary of the Slovenian Language in the Works of Janez Svetokriški
2017-05-15T09:47:34Z
http://hdl.handle.net/11356/1092
Snoj, Marko
2017-05-15T09:47:34Z
The Dictionary of the Slovenian Language in the Works of Janez Svetokriški (Slovar jezika Janeza Svetokriškega) presents and explains the lexis, including proper nouns, from 233 sermons published by Janez Svetokriški in five volumes under the common title Sacrum promptuarium between 1691 and 1707. The dictionary contains 8,540 dictionary entries, which display and treat the entire Slovenian lexis, including proper nouns, used in the above-mentioned work. Each dictionary entry consists of 1. the headword, 2. the presentation of morphological characteristics, 3. the description of meaning and 4. examples of use. Entries containing loanwords additionally include etymologies. Some entries may here provide other philological or linguistic comments. Each entry describing a proper noun ends with the most basic encyclopaedic information. The Dictionary of the Slovenian Language in the Works of Janez Svetokriški is the first dictionary to treat the lexis of a Slovenian author from a period before the introduction of Gaj's Latin alphabet. The dictionary is distinguished by a modern, but not too complex display of material, by comprehensive citations of all attested variants and by the inclusion of encyclopaedic information about proper nouns, all of which in many respects facilitates the reading of the original baroque text or makes it possible in the first place.
This dictionary was published as a printed book:
Snoj, Marko. Slovar jezika Janeza Svetokriškega. Ljubljana : Založba ZRC, 2006. ISBN 961-6568-45-0.
http://hdl.handle.net/11356/1092
Slovenian Academy of Sciences and Arts
dr. Bruno Breschi Foundation
ZRC SAZU
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
dictionary
historical language
lexicography
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10942023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Reverse dictionary of Slovenian language
2017-04-26T11:39:25Z
http://hdl.handle.net/11356/1094
Hajnšek-Holz, Milena
Jakopin, Primož
2017-04-26T11:39:25Z
Reverse dictionary of Slovenian language contains 115,355 headwords and is based on the Dictionary of the Slovenian Standard Language (DSSL). Headwords are sorted a tergo (by last-to-first letter order) and include headwords from DSSL and their variants, full subheadwords and their variants, short subheadwords and subheadwords listed as special verb forms. In addition, information on oblique forms, pronunciation, part of speech and dynamic accent is included.
This dictionary was published as a printed book:
Hajnšek-Holz, Milena, Jakopin, Primož. Odzadnji slovar slovenskega jezika po Slovarju slovenskega knjižnega jezika. Ljubljana : Znanstvenoraziskovalni center Slovenske akademije znanosti in umetnosti : Slovenska akademija znanosti in umetnosti, 1996. ISBN 961-6182-19-6.
http://hdl.handle.net/11356/1094
ZRC SAZU
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
dictionary
lexicography
reverse dictionary
dictionary a tergo
lexicalConceptualResource
Text
oai:www.clarin.si:11356/10952023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Croatian Twitter training corpus ReLDI-NormTag-hr 1.0
2017-04-04T07:59:06Z
http://hdl.handle.net/11356/1095
Ljubešić, Nikola
Farkaš, Daša
Klubička, Filip
Erjavec, Tomaž
Miličević, Maja
Filko, Matea
Kranjčić, Denis
Dujmić, Barbara
2017-04-04T07:59:06Z
ReLDI-NormTag-hr 1.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Croatian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).
The corpus construction is (partially) described in:
MILIČEVIĆ, Maja, LJUBEŠIĆ, Nikola. Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets. Slovenščina 2.0: empirical, applied and interdisciplinary research, 4/2, 2016. ISSN 2335-2736. http://dx.doi.org/10.4312/slo2.0.2016.2.156-188
http://hdl.handle.net/11356/1095
http://hdl.handle.net/11356/1121
Jožef Stefan Institute
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
computer-mediated communication
tokenisation
word normalisation
tagging
lemmatisation
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/10962023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Serbian Twitter training corpus ReLDI-NormTag-sr 1.0
2017-04-04T09:10:17Z
http://hdl.handle.net/11356/1096
Ljubešić, Nikola
Farkaš, Daša
Klubička, Filip
Erjavec, Tomaž
Miličević, Maja
Vuković, Teodora
2017-04-04T09:10:17Z
ReLDI-NormTag-sr 1.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness).
The corpus construction is (partially) described in:
MILIČEVIĆ, Maja, LJUBEŠIĆ, Nikola. Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets. Slovenščina 2.0: empirical, applied and interdisciplinary research, 4/2, 2016. ISSN 2335-2736. http://dx.doi.org/10.4312/slo2.0.2016.2.156-188
http://hdl.handle.net/11356/1096
http://hdl.handle.net/11356/1120
Jožef Stefan Institute
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
computer-mediated communication
tokenisation
word normalisation
tagging
lemmatisation
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/11052023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
R crawlers for five Slovenian web media 1.0
2017-04-23T17:46:05Z
http://hdl.handle.net/11356/1105
Bučar, Jože
2017-04-23T17:46:05Z
Five web-crawlers written in the R language for retrieving Slovenian texts from the news portals 24ur, Dnevnik, Finance, Rtvslo, and Žurnal24. These portals contain political, business, economic and financial content.
http://hdl.handle.net/11356/1105
Faculty of Information Studies Novo mesto
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
web crawling
R
toolService
Software
oai:www.clarin.si:11356/11092023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Automatically sentiment annotated Slovenian news corpus AutoSentiNews 1.0
2017-05-10T07:36:04Z
http://hdl.handle.net/11356/1109
Bučar, Jože
2017-05-10T07:36:04Z
The corpus contains 256,567 documents from the Slovenian news portals 24ur, Dnevnik, Finance, Rtvslo, and Žurnal24. These portals contain political, business, economic and financial content. The submission contains 7 files: 5 of them, which are named after the news portal, contain raw news in txt format retrieved with R crawlers for five Slovenian web media 1.0 (http://hdl.handle.net/11356/1105). The file AutoSentiNews contains of 5 text files that contain 256,567 news articles annotated as positive, negative or neutral at the document level. 1,0427 of them were manually annotated (cf. Manually sentiment annotated Slovenian news corpus SentiNews 1.0, http://hdl.handle.net/11356/1110) and the remaining 246,140 news were annotated automatically. The file SloStopWords contains of 1,784 stop words for Slovene.
http://hdl.handle.net/11356/1109
Faculty of Information Studies Novo mesto
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
news corpus
sentiment classification
opinion mining
corpus
Text
oai:www.clarin.si:11356/11102023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Manually sentiment annotated Slovenian news corpus SentiNews 1.0
2017-04-29T07:32:06Z
http://hdl.handle.net/11356/1110
Bučar, Jože
2017-04-29T07:32:06Z
Between 2 and 6 annotators independently sentiment annotated a stratified random sample of 10,427 documents from the Slovenian news portals 24ur, Dnevnik, Finance, Rtvslo, and Žurnal24. These portals contain political, business, economic and financial content. The texts were annotated using the five-level Lickert scale (1 – very negative, 2 – negative, 3 – neutral, 4 – positive, and 5 – very positive) on three levels of granularity, i.e. on the document, paragraph, and sentence level.
http://hdl.handle.net/11356/1110
Faculty of Information Studies Novo mesto
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
news corpus
sentiment classification
opinion mining
manual annotation
corpus
Text
oai:www.clarin.si:11356/11122023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Slovene sentiment lexicon JOB 1.0
2017-05-09T11:03:02Z
http://hdl.handle.net/11356/1112
Bučar, Jože
2017-05-09T11:03:02Z
The JOB lexicon for sentiment analysis of Slovenian texts contains a list of 25,524 headwords from the List of Slovenian headwords 1.1 (http://hdl.handle.net/11356/1038) extended with sentiment ratings based on the AFINN model with an integer between -5 (very negative) and +5 (very positive). The ratings are derived from the lemmatized version of the Manually sentiment annotated Slovenian (sentence-based) news corpus SentiNews 1.0 (http://hdl.handle.net/11356/1110).
http://hdl.handle.net/11356/1112
Faculty of Information Studies Novo mesto
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
sentiment lexicon
opinion lexicon
lexicalConceptualResource
Text
oai:www.clarin.si:11356/11142023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Slovenian-German Dictionary of Maks Pleteršnik (1894-1895)
2017-07-01T10:47:43Z
http://hdl.handle.net/11356/1114
Pleteršnik, Maks
Furlan, Metka
Dobrovoljc, Helena
Jazbec, Helena
2017-07-01T10:47:43Z
The Slovenian-German Dictionary of Maks Pleteršnik was first published in 1894-1895. It contains 103,185 dictionary entries. Beside standard and dialect lexis of the 19th century Slovenian language it includes also an important part of lexis from 16th onwards. The dictionary is based on lexical material collected by Oroslav Caf, Fran Miklošič, Fran Levstik, Fran Erjavec etc., and is enriched with the lexis of literature, newspapers, specialized literature and dictionaries. The dictionary was re-published by ZRC SAZU in 2006 and this publication is the source of the XML encoded version of this repository entry.
This dictionary was published as a printed book:
• Original edition: Pleteršnik, Maks. Slovensko-nemški slovar. V Ljubljani : Knezoškofijstvo, 1894.
• Re-published edition: Pleteršnik, Maks. Slovensko-nemški slovar. Transliterirana izd. Ljubljana : Založba ZRC, ZRC SAZU, 2006. ISBN 961-6568-31-0.
http://hdl.handle.net/11356/1114
ZRC SAZU
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
dictionary
lexicography
historical language
lexicalConceptualResource
Text
oai:www.clarin.si:11356/11152023-11-21T11:58:12Zhdl_11356_1023hdl_11356_1024
Opinion corpus of Slovene web commentaries KKS 1.001
2017-05-28T08:41:45Z
http://hdl.handle.net/11356/1115
Kadunc, Klemen
Robnik-Šikonja, Marko
2017-05-28T08:41:45Z
The corpus of web commentaries with sentiment categorizations was developed as a part of BSc Thesis (Kadunc, 2016) and served for evaluation of the Slovene Sentiment Lexicon KSS
http://hdl.handle.net/11356/1097. It contains web commentaries about different topics (business, politics, sport, and other) from 4 Slovene web portals (RtvSlo, 24ur, Finance, Reporter). The corpus is in XML format and available in two forms:
- original corpus, containing 4,777 commentaries, 898 positive, 3,291 negative and 588 neutral commentaries.
- balanced corpus, a subset of the original corpus, containing 1,740 commentaries, 580 of each type of sentiment (positive, negative and neutral).
References:
Klemen Kadunc (2016). Določanje sentimenta slovenskim spletnim komentarjem s pomočjo strojnega učenja. Diplomsko delo. Univerza v Ljubljani, Fakulteta za računalništvo in informatiko (in Slovene). http://eprints.fri.uni-lj.si/3317/
Klemen Kadunc, Marko Robnik-Šikonja (2016). Analiza mnenj s pomočjo strojnega učenja in slovenskega leksikona sentimenta. Conference on Language Technologies & Digital Humanities, Ljubljana (in Slovene). http://www.sdjt.si/wp/dogodki/konference/jtdh-2016/zbornik/
http://hdl.handle.net/11356/1115
Faculty of Computer and Information Science, University of Ljubljana
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
web commentaries
opinion corpus
sentiment analysis
corpus
Text
oai:www.clarin.si:11356/11192023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
ccGigafida ARPA language model 1.0
2017-05-05T16:33:24Z
http://hdl.handle.net/11356/1119
Kadivec, Jože
Robnik-Šikonja, Marko
Vintar, Špela
2017-05-05T16:33:24Z
The ccGigafida ARPA language model was created from the ccGigafida written corpus of Slovenian (https://www.clarin.si/repository/xmlui/handle/11356/1035) using the KenLM algorithm in the Moses machine translation framework. It is a general language model of contemporary standard Slovenian language that can be used as a language model in statistical machine translation systems.
The language model was created as a part of the master thesis:
Kadivec, Jože. 2016. Prilagoditev statističnega strojnega prevajalnika za specifično domeno v slovenskem jeziku (Domain specific adaptation of a statistical machine translation engine in Slovene language). Master's thesis, Faculty of computer and information science, University of Ljubljana. https://repozitorij.uni-lj.si/IzpisGradiva.php?id=84815
http://hdl.handle.net/11356/1119
Faculty of Computer and Information Science, University of Ljubljana
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
Moses language model
probability language model
lexicalConceptualResource
Text
oai:www.clarin.si:11356/11202023-07-05T16:58:53Zhdl_11356_1023hdl_11356_1024
Serbian Twitter training corpus ReLDI-NormTag-sr 1.1
2017-05-15T13:43:51Z
http://hdl.handle.net/11356/1120
Ljubešić, Nikola
Farkaš, Daša
Klubička, Filip
Erjavec, Tomaž
Miličević, Maja
Vuković, Teodora
2017-05-15T13:43:51Z
ReLDI-NormTag-sr 1.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). As an update to version 1.0, 1.1 corrects some minor errors.
The corpus construction is (partially) described in:
MILIČEVIĆ, Maja, LJUBEŠIĆ, Nikola. Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets. Slovenščina 2.0: empirical, applied and interdisciplinary research, 4/2, 2016. ISSN 2335-2736. http://dx.doi.org/10.4312/slo2.0.2016.2.156-188
http://hdl.handle.net/11356/1120
http://hdl.handle.net/11356/1171
Jožef Stefan Institute
http://hdl.handle.net/11356/1096
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
computer-mediated communication
tokenisation
word normalisation
tagging
lemmatisation
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/11212023-07-05T16:58:54Zhdl_11356_1023hdl_11356_1024
Croatian Twitter training corpus ReLDI-NormTag-hr 1.1
2017-05-15T12:54:35Z
http://hdl.handle.net/11356/1121
Ljubešić, Nikola
Farkaš, Daša
Klubička, Filip
Erjavec, Tomaž
Miličević, Maja
Filko, Matea
Kranjčić, Denis
Dujmić, Barbara
2017-05-15T12:54:35Z
ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Croatian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). As an update to version 1.0, 1.1 corrects some minor errors.
The corpus construction is (partially) described in:
MILIČEVIĆ, Maja, LJUBEŠIĆ, Nikola. Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets. Slovenščina 2.0: empirical, applied and interdisciplinary research, 4/2, 2016. ISSN 2335-2736. http://dx.doi.org/10.4312/slo2.0.2016.2.156-188
http://hdl.handle.net/11356/1121
http://hdl.handle.net/11356/1170
Jožef Stefan Institute
http://hdl.handle.net/11356/1095
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
computer-mediated communication
tokenisation
word normalisation
tagging
lemmatisation
manual annotation
TEI
corpus
Text
oai:www.clarin.si:11356/11222023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Slovene Grammars and Orthographic Dictionaries
2017-06-17T13:27:28Z
http://hdl.handle.net/11356/1122
Ahačič, Kozma
Dobrovoljc, Helena
Legan Ravnikar, Andreja
Merše, Majda
Furlan, Metka
Narat, Jožica
Marušič, Franc
Žaucer, Rok
Jelovšek, Alenka
Čepar, Metod
Trojar, Mitja
2017-06-17T13:27:28Z
The database contains 25 comprehensive and 25 basic descriptions of 139 Slovene grammars and orthographic dictionaries in book or web format in the period from 1584 to 2015.
http://hdl.handle.net/11356/1122
ZRC SAZU
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
Slovene grammars
Slovene orthography
historical grammars
history of linguistics
syntactic description
lexicalConceptualResource
Text
oai:www.clarin.si:11356/11232023-07-05T16:58:54Zhdl_11356_1023hdl_11356_1024
CMC training corpus Janes-Tag 2.0
2017-05-15T15:30:07Z
http://hdl.handle.net/11356/1123
Erjavec, Tomaž
Fišer, Darja
Čibej, Jaka
Arhar Holdt, Špela
Ljubešić, Nikola
Zupan, Katja
2017-05-15T15:30:07Z
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity annotation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. As an update to version 1.2, 2.0 corrects some minor errors and includes named entity annotation.
A slightly older version of this corpus is described in:
ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf
Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1084.
http://hdl.handle.net/11356/1123
http://hdl.handle.net/11356/1238
Jožef Stefan Institute
http://hdl.handle.net/11356/1085
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
computer-mediated communication
tokenisation
word normalisation
tagging
lemmatisation
manual annotation
TEI
named entities
corpus
Text
oai:www.clarin.si:11356/11242023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Concordances of Primož Trubar's "Ta evangeli sv. Matevža" (1555)
2018-01-27T12:15:54Z
http://hdl.handle.net/11356/1124
Jakopin, Primož
Ahačič, Kozma
2018-01-27T12:15:54Z
The 23603 concordances represent a transcription of the book "Ta evangeli sv. Matevža" (1555) by Primož Trubar.
http://hdl.handle.net/11356/1124
ZRC SAZU
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
16th century Slovenian language
historical language
concordances
corpus
Text
oai:www.clarin.si:11356/11252023-03-27T17:01:18Zhdl_11356_1023hdl_11356_1024
Speech Database of Spoken Flight Information Enquiries SOFES 1.0
2017-06-21T12:43:30Z
http://hdl.handle.net/11356/1125
Dobrišek, Simon
Žganec Gros, Jerneja
Žibert, Janez
Mihelič, France
Pavešić, Nikola
2017-06-21T12:43:30Z
The SOFES speech database (Spoken Flight Enquiries in Slovene) is a collection of transcribed and segmented audio recordings of spoken flight-information enquiries in Slovene. SOFES is built on the basis of the GOPOLIS speech database, which was acquired and compiled by the members of LUKS at the Faculty of Electrical Engineering, University of Ljubljana in the period 1996–1998. The main purpose of the GOPOLIS speech database was the development of an automatic spoken-dialogue system for users who are enquiring about flight information over the telephone. The content of SOFES is, however, sufficiently diverse to allow for the development of more generalized acoustic models of spoken Slovene, which are the key components of various speech technologies, such as speech recognizers and speech synthesizers, as well as biometric speaker-recognition systems, etc.
http://hdl.handle.net/11356/1125
Faculty of Electrical Engineering, University of Ljubljana
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
speech database
speech recognition
spoken corpus
speech transcription
TEI
corpus
Text
oai:www.clarin.si:11356/11272023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Words of the 16th-Century Slovenian Literary Language
2017-07-01T10:51:30Z
http://hdl.handle.net/11356/1127
Ahačič, Kozma
Legan Ravnikar, Andreja
Merše, Majda
Narat, Jožica
Novak, France
2017-07-01T10:51:30Z
This dictionary provides comprehensive information on the vocabulary used in the Slovenian literary language during the period of the Reformation. It was written based on complete concordance from all editions of Slovenian texts from the period 1550-1603. The word entries are accompanied by grammatical information, such as the part of speech used and other grammatical data. The extent of their use is shown by the attributed sources. The features of the linguistic system of that period are also shown by numerous notices regarding written, phonological and morphological variations.
This dictionary was published as a printed book:
Ahačič, Kozma, Legan Ravnikar, Andreja, Merše, Majda, Narat, Jožica, Novak, France. Besedje slovenskega knjižnega jezika 16. stoletja. Ljubljana : Založba ZRC, ZRC SAZU, 2011. ISBN 978-961-254-252-8.
http://hdl.handle.net/11356/1127
ZRC SAZU
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
dictionary
historical language
lexicography
lexicalConceptualResource
Text
oai:www.clarin.si:11356/11282023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Dictionary of Slovenian Particles
2017-08-04T17:21:28Z
http://hdl.handle.net/11356/1128
Žele, Andreja
2017-08-04T17:21:28Z
The dictionary describes the particles in the Slovenian language. It contains 429 entries with information on variants, dynamic and tonal accent, particle type, the meaning and etymology.
The relation to what is being said and the circumstances of the spoken situation are expressed particularly through particles, which is why they are functionally very lively language components of everyday communication. With their semantic-contextual role they actualise what is worded and at the same time condense the message. The particle is one of those non-parts-of-speech that fulfils the textual role of the connector and is, more particularly, ranked among inter-predicate connectors or the connectors in supra-predicate texts. Since particles play primarily a textual role, they are also particularly meaningful words, which can be reasonably used in a text, especially in one’s first language; they maintain a strong communicative (connective) role, and with this a well-marked role of influence. From the communicative-pragmatic perspective, particles are divided into two main categories, namely the connecting (text) particles resulting from pragmatic circumstances, and mood (interpersonal) particles resulting from communicative relationships. Mood particles focus either on the participants, the circumstances, the verbal process or the quantity, e.g. bogvaruj, končno, dejansko, baje, nikar, while the connecting particles highlight textual coherence and cohesion, e.g. celo, kaj šele, drugače, sicer pa, torej, etc. The most comprehensive and functional semantic-circumstantial evaluation of particles can be found in lexical representation.
This dictionary was published as a printed book:
Žele, Andreja. Slovar slovenskih členkov. Ljubljana : Založba ZRC, ZRC SAZU, 2014. ISBN 978-961-254-718-9.
http://hdl.handle.net/11356/1128
ZRC SAZU
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
dictionary
lexicography
particles
lexicalConceptualResource
Text
oai:www.clarin.si:11356/11292023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Dictionary of Slovenian Phrasemes
2017-07-01T10:43:50Z
http://hdl.handle.net/11356/1129
Keber, Janez
2017-07-01T10:43:50Z
The 3,002 entries of this dictionary cover the description and explanation of 13,125 Slovenian phrasemes. The use of phrasemes is represented by citations from lexical files built for the Dictionary of the Slovenian Standard Language as well as from the Nova beseda corpus and other resources. The entries also contain etymological information and equivalents in other languages.
This dictionary was published as a printed book:
Keber, Janez. Slovar slovenskih frazemov. Ljubljana : Založba ZRC, ZRC SAZU, 2011. ISBN 978-961-254-329-7.
http://hdl.handle.net/11356/1129
ZRC SAZU
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
https://creativecommons.org/licenses/by-nc/4.0/
dictionary
phraseology
lexicography
lexicalConceptualResource
Text
oai:www.clarin.si:11356/11302023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Dictionary of Lesser Used Slovenian Words
2017-08-30T08:47:18Z
http://hdl.handle.net/11356/1130
Bokal, Milka
Hajnšek-Holz, Milena
Humar, Marjeta
Jakopin, Franc
Praznik, Zvonka
Šircelj Žnidaršič, Ivanka
Kostanjevec, Polona
Žele, Andreja
Nartnik, Vlado
Keber, Janez
Košmrlj-Levačič, Borislava
2017-08-30T08:47:18Z
Dictionary of Lesser Used Slovenian Words contains 178457 headwords not included in the Dictionary of the Slovenian Standard Language. Information on inflection, part of speech and source is included in the entries.
This dictionary was published as a printed book:
Bokal, Ljudmila, Hajnšek-Holz, Milena, Humar, Marjeta, Jakopin, Franc, Praznik, Zvonka. Besedišče slovenskega jezika : po kartoteki za slovar sodobnega knjižnega jezika zbrane besede, ki niso bile sprejete v Slovar slovenskega knjižnega jezika. 1: A-N, 2: O-Ž. Ljubljana: ZRC SAZU, 1987. Internal edition.
Šircelj-Žnidaršič, Ivanka, Hajnšek-Holz, Milena, Kostanjevec, Polona, Žele, Andreja, Humar, Marjeta, Nartnik, Vlado, Keber, Janez, Košmrlj-Levačič, Borislava, Jakopin, Primož. Besedišče slovenskega jezika z oblikoslovnimi podatki : A - Ž : po gradivu za slovar sodobnega knjižnega jezika zbrane besede, ki niso bile sprejete v Slovar slovenskega knjižnega jezika. Ljubljana: ZRC SAZU, Založba ZRC SAZU, 1998. ISBN 961-6182-62-5.
http://hdl.handle.net/11356/1130
ZRC SAZU
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
dictionary
lexicography
lexicalConceptualResource
Text
oai:www.clarin.si:11356/11352023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Brexit stance annotated tweets
2017-07-12T09:57:01Z
http://hdl.handle.net/11356/1135
Grčar, Miha
Cherepnalkoski, Darko
Mozetič, Igor
Kralj Novak, Petra
2017-07-12T09:57:01Z
The corpus contains over 4.5 million tweets (tweet IDs) automatically labeled by a machine learning program with stance regarding Brexit: Positive (supporting Brexit), Negative (opposing Brexit), or Neutral (uncommitted).
The Brexit referendum was held on June 23, 2016, to decide whether the UK should leave or remain in the EU. In the weeks before the referendum, starting on May 12, the UK geo-located Brexit-related tweets were continuously collected resulting in a dataset of around 4.5 million (4,508,440) tweets from almost one million (998,054) users. A large sample of the collected tweets (35,000) was manually labeled for the stance of their authors regarding Brexit: Positive (supporting Brexit), Negative (opposing Brexit), or Neutral (uncommitted). The labeled tweets were used to train a classifier which then automatically labeled all the remaining tweets.
The corpus contains tweet ids and stance labels. The tweets are grouped into files one hour per file. In each file, one row represents one entry (twitter_id, sentiment_label). Lines are ordered by the tweet time.
The data collection, annotation, model training and performance estimation is described in detail in:
Miha Grčar, Darko Cherepnalkoski, Igor Mozetič, Petra Kralj Novak:
Stance and influence of Twitter users regarding the Brexit referendum.
Computational Social Networks 4/6. 2017. http://dx.doi.org/10.1186/s40649-017-0042-6
http://hdl.handle.net/11356/1135
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
Twitter
Brexit
stance
computer-mediated communication
corpus
Text
oai:www.clarin.si:11356/11362023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Dictionary of the Slovenian Normative Guide (2001)
2019-09-24T07:17:56Z
http://hdl.handle.net/11356/1136
Dular, Janez
Hajnšek-Holz, Milena
Jakopin, Franc
Moder, Janko
Toporišič, Jože
Ahlin, Martin
Bokal, Ljudmila
Gložančev, Alenka
Keber, Janez
Lazar, Branka
Praznik, Zvonka
Snoj, Jerica
Vojnovič, Nastja
Suhadolnik, Stane
Weiss, Peter
Nartnik, Vlado
2019-09-24T07:17:56Z
The dictionary part of Slovenian Normative Guide (first published in 2001) is a normative orthographic dictionary of Slovenian standard language. In 92,617 entries it contains 140,266 lemmas and sublemmas. The entries contain information on spelling, pronunciation, inflection, part of speech, normative information, synonyms, valency, while semantic identifications, labels and usage examples are also provided.
This dictionary was published as a printed book:
Toporišič, Jože; Jakopin, Franc; Moder, Janko; Dular, Janez; Suhadolnik, Stane; Menart, Janez; Pogorelec, Breda; Gantar, Kajetan; Ahlin, Martin; Hajnšek - Holz, Milena; Bokal, Ljudmila; Gložančev, Alenka; Keber, Janez; Lazar, Branka; Praznik, Zvonka; Snoj, Jerica. Slovenski pravopis. Ljubljana: Založba ZRC, ZRC SAZU, 2001. ISBN 961-6358-37-5.
http://hdl.handle.net/11356/1136
ZRC SAZU
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
https://creativecommons.org/licenses/by-nc-nd/4.0/
dictionary
Slovene orthography
lexicography
normativism
codification
lexicalConceptualResource
Text
oai:www.clarin.si:11356/11372023-07-05T16:58:56Zhdl_11356_1023hdl_11356_1024
Wikipedia talk corpus Janes-Wiki 1.0
2017-08-31T07:08:15Z
http://hdl.handle.net/11356/1137
Ljubešić, Nikola
Erjavec, Tomaž
Fišer, Darja
2017-08-31T07:08:15Z
Janes-Wiki is an annotated corpus of discussion pages from the Slovene Wikipedia from the period 2003-08 to 2017-06. The corpus contains page and user talks and is structured into individual pages and their comments, together with their metadata. The texts in the corpus are tokenised, sentence segmented, word normalised, morphosyntactically tagged, lemmatised and annotated with named entities.
http://hdl.handle.net/11356/1137
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
computer-mediated communication
Wikipedia
word normalisation
named entities
TEI
corpus
Text
oai:www.clarin.si:11356/11382023-07-05T16:58:56Zhdl_11356_1023hdl_11356_1024
Blog post and comment corpus Janes-Blog 1.0
2017-08-31T07:13:40Z
http://hdl.handle.net/11356/1138
Erjavec, Tomaž
Ljubešić, Nikola
Fišer, Darja
2017-08-31T07:13:40Z
Janes-Blog is an annotated corpus of Slovene blogs from websites rtvslo.si and publishwall.si from the period 2006-10 to 2016-01. The corpus is structured into individual texts containing the post of the blog and comments on the post, together with their metadata. The texts in the corpus are tokenised, sentence segmented, word normalised, morphosyntactically tagged, lemmatised and annotated with named entities. Due to protection of privacy, usernames are not included in the metadata and 'person' as well as 'person derivative' named entities have been removed from the texts.
http://hdl.handle.net/11356/1138
Jožef Stefan Institute
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
computer-mediated communication
blogs
word normalisation
named entities
TEI
corpus
Text
oai:www.clarin.si:11356/11392023-07-05T16:58:56Zhdl_11356_1023hdl_11356_1024
Forum corpus Janes-Forum 1.0
2017-08-31T07:16:34Z
http://hdl.handle.net/11356/1139
Erjavec, Tomaž
Ljubešić, Nikola
Fišer, Darja
2017-08-31T07:16:34Z
Janes-Forum is an annotated corpus of Slovene forums from websites med.over.net, avtomobilizem.com, and kvarkadabra.net from the period 2001-02 to 2015-01. The corpus is structured into forums, threads and posts, together with their metadata. The texts in the corpus are tokenised, sentence segmented, word normalised, morphosyntactically tagged, lemmatised and annotated with named entities. Due to protection of privacy and compliance with wishes of platform owners, usernames are not included in the metadata, and 'person', 'person derivative' and 'company name' named entities have been removed from the texts.
http://hdl.handle.net/11356/1139
Jožef Stefan Institute
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
computer-mediated communication
forums
word normalisation
named entities
TEI
corpus
Text
oai:www.clarin.si:11356/11402023-07-05T16:58:57Zhdl_11356_1023hdl_11356_1024
News comment corpus Janes-News 1.0
2017-08-31T07:19:08Z
http://hdl.handle.net/11356/1140
Erjavec, Tomaž
Ljubešić, Nikola
Fišer, Darja
2017-08-31T07:19:08Z
Janes-News is an annotated corpus of comments on online news articles from websites rtvslo.si, mladina.si, and reporter.si from the period 2007-03 to 2015-01. The corpus is structured into individual texts containing the comments on a news article, together with their metadata. The texts in the corpus are tokenised, sentence segmented, word normalised, morphosyntactically tagged, lemmatised and annotated with named entities. Due to protection of privacy, usernames are not included in the metadata and 'person' as well as 'person derivative' named entities have been removed from the texts.
http://hdl.handle.net/11356/1140
Jožef Stefan Institute
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
computer-mediated communication
news comments
word normalisation
named entities
TEI
corpus
Text
oai:www.clarin.si:11356/11412023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Beseda Corpus Lemmatisation Lexicon
2017-09-25T10:21:07Z
http://hdl.handle.net/11356/1141
Jakopin, Primož
2017-09-25T10:21:07Z
Beseda Corpus Lemmatisation Lexicon for Slovenian language was generated at the Fran Ramovš Institute of Slovenian Language, primarily through inflection of open class words from the Dictionary of Standard Slovenian (Slovar slovenskega knjižnega jezika), augmented by wordforms, their part of speech tags and their lemmas used during the PoS tagging and lemmatization of the Beseda corpus. It was initially (2000) composed of 1 million words from the following texts:
Ciril Kosmač Opus - 408,000 words
Tomo Križnar: O iskanju ljubezni / On Search for Love or Around the World by Bicycle - 132,000 words
George Orwell: 1984 / 1984 - 91,000 words
Plato: Država / Republic - 93,000 words
Sveto pismo Nove zaveze / The Bible - New Testament - 150,000 words
Gustave Flaubert: Bouvard in Pécuchet / Bouvard and Pécuchet - 86,000 words
Časopis DELO na internetu (vzorec iz 6.5.1997 - 17.6.1997) / Newspaper DELO on Internet (a sample from 5/6/1997 - 6/17/1997) - 52,000 words
After 2000 the following texts were added:
Marko Uršič: Štirje časi / Four Seasons - 171,000 words
Državni zbor RS 3. sklica - dobesedni zapisi sej: 29. redna seja, zasedanje 01.10.2003 / National Assembly of the Republic of Slovenia - session transcripts: 29th regular session, meeting of 10/1/2003 - 47,000 words
Časopis DELO za 3.1.2004 / Newspaper DELO for 1/3/2004 - 75,000 words
to round the corpus to 1,300,000 words.
Current lexicon was taken from the database of the online "Determination of Lemmas and PoS Tags for a List of Words" service at the Institute, available through the web page: http://bos.zrc-sazu.si/dol_lem1.html Wordform frequencies were compiled from the latest update of the abovementioned corpus (version 138, 1,300,626 words, August 2017) and are therefore approximate.
Lexicon is UTF-8 coded, has 3,228,128 lines, each of the following 4 data fields, tab separated:
1. wordform
2. lemma (102,346 different lemmas)
3. PoS tag (explained at http://bos.zrc-sazu.si/bibliografija/o_oznake.html - in Slovenian)
4. approximate corpus frequency; wordform-lemma-PoS entries not in corpus have zero frequency
http://hdl.handle.net/11356/1141
ZRC SAZU
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
morphology
inflection
word forms
lemmatisation
lexicalConceptualResource
Text
oai:www.clarin.si:11356/11422023-07-05T16:58:57Zhdl_11356_1023hdl_11356_1024
Twitter corpus Janes-Tweet 1.0
2017-09-05T14:23:23Z
http://hdl.handle.net/11356/1142
Ljubešić, Nikola
Erjavec, Tomaž
Fišer, Darja
2017-09-05T14:23:23Z
Janes-Tweet is an annotated corpus of almost 10 million tweets posted from 2013-06 to 2017-06 by approx. 9,000 users that tweet mostly in Slovene. The corpus is structured into individual tweets, together with their metadata. The tweets in the corpus are tokenised, sentence segmented, word normalised, morphosyntactically tagged, lemmatised and annotated with named entities.
Due to Twitter terms-of-service, the corpus is distributed in an encoded version. The included tweetpub program (also available and documented on https://github.com/clarinsi/tweetpub) should be used to decode it, which it does by fetching the original tweets and applying a diff operation on the distributed corpus. Note that the retrieved corpus can have fewer tweets than the distributed version if some have been removed from Twitter by their authors in the meantime.
http://hdl.handle.net/11356/1142
Jožef Stefan Institute
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
https://creativecommons.org/licenses/by-nc/4.0/
computer-mediated communication
Twitter
word normalisation
named entities
corpus
Text
oai:www.clarin.si:11356/11502023-06-12T10:47:50Zhdl_11356_1023hdl_11356_1024
Developmental corpus of Slovene (without language corrections) Šolar-Clear
2018-11-21T16:52:08Z
http://hdl.handle.net/11356/1150
Rozman, Tadeja
Stritar Kučuk, Mojca
Kosem, Iztok
Krek, Simon
Krapš Vodopivec, Irena
Arhar Holdt, Špela
Stabej, Marko
Laskowski, Cyprian
Klemenc, Bojan
2018-11-21T16:52:08Z
Šolar-Clear is an adapted version of the Šolar 1.0 corpus, cf. http://hdl.handle.net/11356/1036.
The Šolar(-Clear) corpus consists of texts written by students in Slovene primary and secondary schools. School essays form the majority of the corpus (64.2%) while other material includes texts created during lessons, such as text recapitulations or descriptions, examples of formal applications etc.
Unlike the original Šolar corpus, Šolar-Clear only includes student texts while language corrections and other types of feedback from the teachers are not included. The corpus can thus be used for processing tasks where the inclusion of corrections hinders or complicates the procedures (e.g. for comparative data extraction, training of language models etc).
http://hdl.handle.net/11356/1150
http://hdl.handle.net/11356/1219
Trojina, Institute for Applied Slovene Studies
http://hdl.handle.net/11356/1036
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
student writing
developmental corpus
corpus
Text
oai:www.clarin.si:11356/11542023-07-05T16:58:58Zhdl_11356_1023hdl_11356_1024
Tweet code-switching corpus Janes-Preklop 1.0
2017-10-18T18:28:09Z
http://hdl.handle.net/11356/1154
Reher, Špela
Erjavec, Tomaž
Fišer, Darja
2017-10-18T18:28:09Z
Janes-Preklop is a corpus of Slovene tweets that is manually annotated for code-switching (the use of words from two or more languages within one sentence or utterance), according to the supplied typology. Words in the corpus are also automatically tagged with MSDs and lemmas.
http://hdl.handle.net/11356/1154
Jožef Stefan Institute
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
computer-mediated communication
Twitter
code-switching
TEI
manual annotation
corpus
Text
oai:www.clarin.si:11356/11552023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Nova Beseda Frequency Lexicon
2017-09-25T08:53:32Z
http://hdl.handle.net/11356/1155
Jakopin, Primož
2017-09-25T08:53:32Z
Nova beseda Frequency Lexicon was compiled from the Nova beseda text corpus at the Fran Ramovš Institute of Slovenian Language with hyphen characters unified and with leading and trailing non-breaking spaces deleted.
Unlike most other Slovenian corpora Nova beseda texts were pre-processed before inclusion. Typos and words with supefluous hyphens, originating from false line joinings were corrected and parts of texts in foreign, non-Slovenian language were marked-up and excluded from the lexicon.
The corpus contains 318 million tokens, mostly wordforms. It is available for search through the web page http://bos.zrc-sazu.si/a_beseda.html, where wordform search is reached by selecting "word seach" in the right hand side "What to do?" column. On the mentioned web page the corpus structure is also explained.
The lexicon is UTF-8 coded, has 2,251,151 lines, each containing the following 2 data fields, tab separated:
1. token, Slovenian: pojavnica.
The vast majority of tokens are wordforms, also included are numbers and selected multiword units such as URLs, e-mail addresses, place names like New York, car plates, ID numbers.
2. frequency, Slovenian: pogostnost.
The sum of all frequencies is 318,170,212.
http://hdl.handle.net/11356/1155
ZRC SAZU
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
word forms
lexicon
lexicalConceptualResource
Text
oai:www.clarin.si:11356/11562023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Automatically stress labelled morphological lexicon Sloleks 1.2
2017-09-28T08:35:07Z
http://hdl.handle.net/11356/1156
Krsnik, Luka
Robnik-Šikonja, Marko
Šef, Tomaž
2017-09-28T08:35:07Z
This lexicon is an extended version of Sloleks 1.2 (http://hdl.handle.net/11356/1039). It contains all the original data from Sloleks with added information about the stress of each word form, which is included two ways: information about stress location only, and information about stress location and type. The stress assignment was performed automatically, with algorithms based on deep neural networks which correctly predicted accent location in around 90 % and combined accent type and location in about 87.5 % of test data. Therefore all accents are not correct.
http://hdl.handle.net/11356/1156
http://hdl.handle.net/11356/1186
Faculty of Computer and Information Science, University of Ljubljana
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
word stress
lexicalConceptualResource
Text
oai:www.clarin.si:11356/11582023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Spoken corpus Gos VideoLectures 2.0 (transcription)
2017-10-12T13:35:27Z
http://hdl.handle.net/11356/1158
Verdonik, Darinka
Potočnik, Tomaž
Sepesy Maučec, Mirjam
Erjavec, Tomaž
2017-10-12T13:35:27Z
Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech.
The Gos VideoLectures corpus contains a selection of public lectures available through the web portal Videolectures.net provided by the Jožef Stefan Institute, and covers 9.8 hours of speech.
This resource contains only annotated transcriptions of the corpus – audio recordings are available at http://hdl.handle.net/11356/1159.
All transcriptions for Gos VideoLectures were done manually and carefully checked. The main guidelines for transcription were those of the Gos corpus (http://www.korpus-gos.net/Support/About). The transcription tool Transcriber 1.5.1 (http://trans.sourceforge.net/en/presentation.php) was used for making transcriptions. It can be also used for reading or exporting transcriptions (.trs files) to different formats.
The transcriptions comprise the TRS files with tabular metadata, their conversion to TEI and to the CWB vertical file format. Each recording has two TRS files, one with pronunciation-based and the other with the standardised/normalised transcription. The TEI and CWB encodings join these two transcriptions at the token level, with the normalised words being also automatically PoS tagged and lemmatised.
The corpus can be used for training continuous speech recognition for Slovene language, for phonetic research or any other research of Slovene academic speech.
http://hdl.handle.net/11356/1158
http://hdl.handle.net/11356/1190
Faculty of Electrical Engineering and Computer Science, University of Maribor
http://hdl.handle.net/11356/1069
Creative Commons - Attribution 4.0 International (CC BY 4.0)
https://creativecommons.org/licenses/by/4.0/
speech database
spoken corpus
academic speech
speech transcription
speech recognition
TEI
corpus
Text
oai:www.clarin.si:11356/11592023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024
Spoken corpus Gos VideoLectures 2.0 (audio)
2017-10-13T09:29:26Z
http://hdl.handle.net/11356/1159
VideoLectures.NET
2017-10-13T09:29:26Z
Gos VideoLectures is an add-on to the Gos reference corpus of spoken Slovene (http://hdl.handle.net/11356/1040), and covers public academic speech. The Gos VideoLectures corpus contains a selection of public lectures available through the web portal Videolectures.net provided by the Jožef Stefan Institute, and covers 9.8 hours of speech.
This resource contains only audio recordings of the corpus – annotated transcriptions are available at http://hdl.handle.net/11356/1158.
http://hdl.handle.net/11356/1159
http://hdl.handle.net/11356/1189
VideoLectures.NET
http://hdl.handle.net/11356/1070
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
https://creativecommons.org/licenses/by-nc-nd/4.0/
speech database
spoken corpus
academic speech
speech recognition
speech recordings
corpus
Text
oai:www.clarin.si:11356/11652023-06-20T12:44:21Zhdl_11356_1023hdl_11356_1024
Training corpus ssj500k 2.0
2017-11-23T21:36:56Z
http://hdl.handle.net/11356/1165
Krek, Simon
Dobrovoljc, Kaja
Erjavec, Tomaž
Može, Sara
Ledinek, Nina
Holz, Nanika
Zupan, Katja
Gantar, Polona
Kuzman, Taja
2017-11-23T21:36:56Z
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. About half of the corpus is also manually annotated with syntactic dependencies, named entities, and verbal multiword expressions.
The annotations of the ssj500k corpus follow (1) the MULTEXT-East V5 morphosyntactic specifications for Slovene, https://nl.ijs.si/ME/V5/msd/, (2) the JOS dependency schema, https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf, (3) the Janes Annotation guidelines for Slovenian named entities, https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, and the Guidelines of the PARSEME shared task on verbal multiword expressions, http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.0/
The vocabulary of (1) and (2) is provided in the back element and (3) and (4) in the teiHeader of the TEI encoded corpus.
http://hdl.handle.net/11356/1165
http://hdl.handle.net/11356/1181
Centre for Language Resources and Technologies, University of Ljubljana
http://hdl.handle.net/11356/1052
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
https://creativecommons.org/licenses/by-nc-sa/4.0/
tagging
dependency treebank
parsing
named entities
tokenisation
manual annotation
TEI
verbal multiword expressions
corpus
Text
oai:www.clarin.si:11356/11662024-02-06T11:39:13Zhdl_11356_1023hdl_11356_1024
Thesaurus of Modern Slovene 1.0
2018-03-25T12:06:22Z
http://hdl.handle.net/11356/1166
Krek, Simon
Laskowski, Cyprian
Robnik-Šikonja, Marko
Kosem, Iztok
Arhar Holdt, Špela
Gantar, Polona
Čibej, Jaka
Gorjanc, Vojko
Klemenc, Bojan
Dobrovoljc, Kaja
2018-03-25T12:06:22Z
This is an automatically created Slovene thesaurus from Slovene data available in a comprehensive English–Slovenian dictionary, a monolingual dictionary, and a corpus. A network analysis on the bilingual dictionary word co-occurrence graph was used, together with additional information from the distributional thesaurus data available as part of the Sketch Engine tool and extracted from the 1.2 billion word Gigafida corpus and the monolingual dictionary.
http://hdl.handle.net/11356/1166
http://hdl.handle.net/11356/1916
Centre for Language Resources and Technologies, University of Ljubljana
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
https://creativecommons.org/licenses/by-sa/4.0/
thesaurus
lexicalConceptualResource
Text
olac///hdl_11356_1024/100