Prikaži enostavni zapis vnosa

 
dc.contributor.author Batanović, Vuk
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Samardžić, Tanja
dc.contributor.author Erjavec, Tomaž
dc.date.accessioned 2023-07-22T14:28:21Z
dc.date.available 2023-07-22T14:28:21Z
dc.date.issued 2023-06-13
dc.identifier.uri http://hdl.handle.net/11356/1843
dc.description The SETimes.SR training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, http://nl.ijs.si/ME/V6/msd/, (2) the UDv2 Guidelines, http://universaldependencies.org/guidelines.html, and (3) the Janes annotation guidelines for named entities, http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus. The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade.
dc.language.iso srp
dc.publisher Regional Linguistic Data Initiative Centre ReLDI
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby http://www.aclweb.org/anthology/W17-1407
dc.relation.replaces http://hdl.handle.net/11356/1200
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://github.com/reldi-data/SETimes.SRPlus
dc.subject part-of-speech tagging
dc.subject dependency treebank
dc.subject parsing
dc.subject named entities
dc.subject tokenisation
dc.subject manual annotation
dc.subject TEI
dc.title Serbian linguistic training corpus SETimes.SR 2.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor Swiss National Science Foundation 160501 ReLDI Other
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 176 texts
size.info 4384 sentences
size.info 97673 tokens
files.count 4
files.size 9861171


 Datoteke v tem vnosu

 Prenesi vse datoteke v vnosu (9.4 MB)
Icon
Ime
set.sr.plus.conllup
Velikost
8.21 MB
Format
Neznano
Opis
CoNLL-U-Plus dataset
MD5
5e5f3c9583418bbff1cbfc5e344bb21e
 Prenesi datoteko
Icon
Ime
set.sr.plus-train.conllu.gz
Velikost
924.97 KB
Format
application/gzip
Opis
CoNLL-U training dataset
MD5
39bfa2630c045e4d94aa7ee33e71a9d4
 Prenesi datoteko
Icon
Ime
set.sr.plus-dev.conllu.gz
Velikost
150.68 KB
Format
application/gzip
Opis
CoNLL-U development dataset
MD5
f890061031a2a51e58e946ed1eefbf59
 Prenesi datoteko
Icon
Ime
set.sr.plus-test.conllu.gz
Velikost
142.91 KB
Format
application/gzip
Opis
CoNLL-U test dataset
MD5
96b52b7b589e6b765b5e13f1c885abd8
 Prenesi datoteko

Prikaži enostavni zapis vnosa