dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Fišer, Darja |
dc.contributor.author | Čibej, Jaka |
dc.contributor.author | Arhar Holdt, Špela |
dc.date.accessioned | 2016-12-18T16:00:18Z |
dc.date.available | 2016-12-22T10:10:03Z |
dc.date.issued | 2016-12-22 |
dc.identifier.uri | http://hdl.handle.net/11356/1079 |
dc.description | Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require higlhy accurate and reliable annotations. The corpus is further described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1080. |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1081 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://nl.ijs.si/janes/ |
dc.subject | computer-mediated communication |
dc.subject | tokenisation |
dc.subject | word normalisation |
dc.subject | tagging |
dc.subject | lemmatisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.title | CMC training corpus Janes-Tag 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds |
size.info | 2958 texts |
size.info | 75276 tokens |
files.count | 4 |
files.size | 2244018 |
Files in this item
Download all files in item (2.14 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- RASLAN16-Janes.pdf
- Size
- 210.85 KB
- Format
- Description
- RASLAN'16 paper describing the corpus
- MD5
- 7487b904191a41f8cb38bbdfe12ba14e

- Name
- Janes-smernice-v1.0.pdf
- Size
- 339.69 KB
- Format
- Description
- Annotation Guidelines (in Slovene)
- MD5
- 39845f938d68b0e3330259eab31c6043

- Name
- Janes-Tag.vert.zip
- Size
- 536.93 KB
- Format
- application/zip
- Description
- Corpus in vertical format
- MD5
- 6919d06b5011bb6b8cb739eb11271dc1

- Name
- Janes-Tag.zip
- Size
- 1.08 MB
- Format
- application/zip
- Description
- Corpus in TEI format
- MD5
- e484a07e41606415f801e749f9efd6ff