dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Fišer, Darja |
dc.contributor.author | Čibej, Jaka |
dc.contributor.author | Arhar Holdt, Špela |
dc.date.accessioned | 2016-12-30T13:53:05Z |
dc.date.available | 2016-12-30T13:53:05Z |
dc.date.issued | 2016-12-30 |
dc.identifier.uri | http://hdl.handle.net/11356/1084 |
dc.description | Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation and word normalisation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. A slightly older version of this corpus is described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf Note that the corpus is also annotated with morphosyntactic descriptions and lemmas. These annotations are manual where the texts correspond to the Janes-Tag corpus (http://hdl.handle.net/11356/1085) and automatic for the other texts. |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://nl.ijs.si/janes/viri/rocno-oznaceni-korpusi/#Janes-Norm |
dc.relation.isreferencedby | https://doi.org/10.1007/s10579-018-9425-z |
dc.relation.replaces | http://hdl.handle.net/11356/1083 |
dc.relation.replaces | http://hdl.handle.net/11356/1080 |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1733 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://nl.ijs.si/janes/ |
dc.subject | computer-mediated communication |
dc.subject | tokenisation |
dc.subject | word normalisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.title | CMC training corpus Janes-Norm 1.2 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
sponsor | Swiss National Science Foundation 160501 ReLDI Other |
size.info | 7816 texts |
size.info | 184755 tokens |
files.count | 4 |
files.size | 4203692 |
featuredService.kontext | Search|https://www.clarin.si/kontext/first_form?corpname=janes_norm |
featuredService.noske | Search|https://www.clarin.si/ske/#dashboard?corpname=janes_norm |
Files in this item
Download all files in item (4.01 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- RASLAN16-Janes.pdf
- Size
- 210.85 KB
- Format
- Description
- RASLAN'16 paper describing the corpus
- MD5
- 7487b904191a41f8cb38bbdfe12ba14e

- Name
- Janes-smernice-v1.0.pdf
- Size
- 339.69 KB
- Format
- Description
- Annotation Guidelines (in Slovene)
- MD5
- 39845f938d68b0e3330259eab31c6043

- Name
- Janes-Norm.zip
- Size
- 2.1 MB
- Format
- application/zip
- Description
- Corpus in TEI format
- MD5
- 2a67d77dfecdacfca5d0b9268f26648c
- Janes-Norm
- janes.norm.xml12 kB
- msd-fslib-sl.xml461 kB
- janes.norm.body.xml14 MB
- schema
- tei_janes_doc.html2 MB
- tei_janes.rng399 kB
- tei_janes_schema.xml2 kB
- tei_janes.zip44 kB
- tei_janes.rnc188 kB

- Name
- Janes-Norm.vert.zip
- Size
- 1.37 MB
- Format
- application/zip
- Description
- Derived corpus in vertical format
- MD5
- 346a866ce28ed4bd095ba1115c3f29dd