dc.contributor.author | Batanović, Vuk |
dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Samardžić, Tanja |
dc.contributor.author | Erjavec, Tomaž |
dc.date.accessioned | 2023-07-22T14:28:21Z |
dc.date.available | 2023-07-22T14:28:21Z |
dc.date.issued | 2023-06-13 |
dc.identifier.uri | http://hdl.handle.net/11356/1843 |
dc.description | The SETimes.SR training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, http://nl.ijs.si/ME/V6/msd/, (2) the UDv2 Guidelines, http://universaldependencies.org/guidelines.html, and (3) the Janes annotation guidelines for named entities, http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus. The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade (https://reldi.spur.uzh.ch). |
dc.language.iso | srp |
dc.publisher | Regional Linguistic Data Initiative Centre ReLDI |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | http://www.aclweb.org/anthology/W17-1407 |
dc.relation.replaces | http://hdl.handle.net/11356/1200 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://github.com/reldi-data/SETimes.SRPlus |
dc.subject | part-of-speech tagging |
dc.subject | dependency treebank |
dc.subject | parsing |
dc.subject | named entities |
dc.subject | tokenisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.title | Serbian linguistic training corpus SETimes.SR 2.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | Swiss National Science Foundation 160501 ReLDI Other |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
size.info | 176 texts |
size.info | 4384 sentences |
size.info | 97673 tokens |
files.count | 4 |
files.size | 9861171 |
Files in this item
Download all files in item (9.4 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
- Name
- set.sr.plus.conllup
- Size
- 8.21 MB
- Format
- Unknown
- Description
- CoNLL-U-Plus dataset
- MD5
- 5e5f3c9583418bbff1cbfc5e344bb21e
- Name
- set.sr.plus-train.conllu.gz
- Size
- 924.97 KB
- Format
- application/gzip
- Description
- CoNLL-U training dataset
- MD5
- 39bfa2630c045e4d94aa7ee33e71a9d4
- Name
- set.sr.plus-dev.conllu.gz
- Size
- 150.68 KB
- Format
- application/gzip
- Description
- CoNLL-U development dataset
- MD5
- f890061031a2a51e58e946ed1eefbf59
- Name
- set.sr.plus-test.conllu.gz
- Size
- 142.91 KB
- Format
- application/gzip
- Description
- CoNLL-U test dataset
- MD5
- 96b52b7b589e6b765b5e13f1c885abd8