dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Fišer, Darja |
dc.contributor.author | Čibej, Jaka |
dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Zupan, Katja |
dc.date.accessioned | 2017-05-15T15:30:07Z |
dc.date.available | 2017-05-15T15:30:07Z |
dc.date.issued | 2017-05-14 |
dc.identifier.uri | http://hdl.handle.net/11356/1123 |
dc.description | Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity annotation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. As an update to version 1.2, 2.0 corrects some minor errors and includes named entity annotation. A slightly older version of this corpus is described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1084. |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | http://nl.ijs.si/janes/viri/rocno-oznaceni-korpusi/#Janes-Tag |
dc.relation.isreferencedby | https://doi.org/10.1007/s10579-018-9425-z |
dc.relation.replaces | http://hdl.handle.net/11356/1085 |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1238 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://nl.ijs.si/janes/ |
dc.subject | computer-mediated communication |
dc.subject | tokenisation |
dc.subject | word normalisation |
dc.subject | tagging |
dc.subject | lemmatisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.subject | named entities |
dc.title | CMC training corpus Janes-Tag 2.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
hidden | hidden |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
sponsor | Swiss National Science Foundation 160501 ReLDI Other |
sponsor | ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds |
size.info | 2958 texts |
size.info | 75276 tokens |
files.count | 7 |
files.size | 4011388 |
Datoteke v tem vnosu
Prenesi vse datoteke v vnosu (3.83 MB)To je vnos
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Ime
- Janes-Tag.zip
- Velikost
- 1.09 MB
- Format
- application/zip
- Opis
- Corpus in TEI format
- MD5
- b56533df4d243b2d3c20282083e95118

- Ime
- Janes-Tag.vert.zip
- Velikost
- 560.42 KB
- Format
- application/zip
- Opis
- Derived corpus in vertical format
- MD5
- 2ce844ec29c48f33099c909c515cd7a0

- Ime
- Janes-Tag.vert.split.zip
- Velikost
- 554.52 KB
- Format
- application/zip
- Opis
- Corpus split into train, dev and test vertical files
- MD5
- 8c5e88dc8a2ba848a2bd0c26a4afebeb

- Ime
- RASLAN16-Janes.pdf
- Velikost
- 210.85 KB
- Format
- Opis
- RASLAN'16 paper describing the corpus
- MD5
- 7487b904191a41f8cb38bbdfe12ba14e

- Ime
- Janes-smernice-v1.0.pdf
- Velikost
- 339.69 KB
- Format
- Opis
- Annotation Guidelines - main (in Slovene)
- MD5
- 39845f938d68b0e3330259eab31c6043

- Ime
- SlovenianNER-slv-v1.1.pdf
- Velikost
- 549.4 KB
- Format
- Opis
- Annotation Guidelines - named entities (in Slovene)
- MD5
- aac3c6f343bc350e1219c00ce232dd87

- Ime
- SlovenianNER-eng-v1.1.pdf
- Velikost
- 590.05 KB
- Format
- Opis
- Annotation Guidelines - named entities (in English)
- MD5
- d3ce678280a4adbe16dfac47a16fc233