dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Fišer, Darja |
dc.contributor.author | Čibej, Jaka |
dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Ljubešić, Nikola |
dc.date.accessioned | 2016-12-30T14:02:38Z |
dc.date.available | 2016-12-30T14:02:38Z |
dc.date.issued | 2016-12-30 |
dc.identifier.uri | http://hdl.handle.net/11356/1085 |
dc.description | Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. A slightly older version of this corpus is described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1084. |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.relation.replaces | http://hdl.handle.net/11356/1079 |
dc.relation.replaces | http://hdl.handle.net/11356/1081 |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1123 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://nl.ijs.si/janes/ |
dc.subject | computer-mediated communication |
dc.subject | tokenisation |
dc.subject | word normalisation |
dc.subject | tagging |
dc.subject | lemmatisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.title | CMC training corpus Janes-Tag 1.2 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
hidden | hidden |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
sponsor | Swiss National Science Foundation 160501 ReLDI Other |
size.info | 2958 texts |
size.info | 75276 tokens |
files.count | 5 |
files.size | 2805327 |
Files in this item
Download all files in item (2.68 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- RASLAN16-Janes.pdf
- Size
- 210.85 KB
- Format
- Description
- RASLAN'16 paper describing the corpus
- MD5
- 7487b904191a41f8cb38bbdfe12ba14e

- Name
- Janes-smernice-v1.0.pdf
- Size
- 339.69 KB
- Format
- Description
- Annotation Guidelines (in Slovene)
- MD5
- 39845f938d68b0e3330259eab31c6043

- Name
- Janes-Tag.zip
- Size
- 1.08 MB
- Format
- application/zip
- Description
- Corpus in TEI format
- MD5
- 0741700ea329a7f58101009ccd74f5b0

- Name
- Janes-Tag.vert.zip
- Size
- 556.63 KB
- Format
- application/zip
- Description
- Derived corpus in vertical format
- MD5
- 3327740d9a2e6316345a4db276ccae27

- Name
- Janes-Tag.vert.split.zip
- Size
- 528.47 KB
- Format
- application/zip
- Description
- Corpus split into train, dev and test vertical files
- MD5
- 18cb1bacac4a53310e9614737259fa1e