dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Fišer, Darja |
dc.contributor.author | Čibej, Jaka |
dc.contributor.author | Arhar Holdt, Špela |
dc.date.accessioned | 2016-12-28T11:41:07Z |
dc.date.available | 2016-12-28T11:41:07Z |
dc.date.issued | 2016-12-28 |
dc.identifier.uri | http://hdl.handle.net/11356/1083 |
dc.description | Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation and word normalisation of non-standard Slovene. The corpus is also automatically annotated with morphosyntactic descriptions and lemmas. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. The corpus is further described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf Note that a related corpus, Janes-Tag is also available, cf. http://hdl.handle.net/11356/1081. |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1084 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://nl.ijs.si/janes/ |
dc.subject | computer-mediated communication |
dc.subject | tokenisation |
dc.subject | word normalisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.title | CMC training corpus Janes-Norm 1.1 |
dc.type | corpus |
dcterms.isReplacedBy | http://hdl.handle.net/11356/1084 |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
sponsor | Swiss National Science Foundation 160501 ReLDI Other |
size.info | 7816 texts |
size.info | 184766 tokens |
files.count | 4 |
files.size | 4155553 |
Datoteke v tem vnosu
Prenesi vse datoteke v vnosu (3.96 MB)To je vnos
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Ime
- RASLAN16-Janes.pdf
- Velikost
- 210.85 KB
- Format
- Opis
- RASLAN'16 paper describing the corpus
- MD5
- 7487b904191a41f8cb38bbdfe12ba14e

- Ime
- Janes-smernice-v1.0.pdf
- Velikost
- 339.69 KB
- Format
- Opis
- Annotation Guidelines (in Slovene)
- MD5
- 39845f938d68b0e3330259eab31c6043

- Ime
- Janes-Norm.zip
- Velikost
- 2.05 MB
- Format
- application/zip
- Opis
- Corpus in TEI format
- MD5
- 9d94f664c34da23f364fef95b23644c9
- Janes-Norm
- janes.norm.xml12 kB
- msd-fslib-sl.xml461 kB
- janes.norm.body.xml14 MB
- schema
- tei_janes_doc.html2 MB
- tei_janes.rng399 kB
- tei_janes_schema.xml2 kB
- tei_janes.zip44 kB
- tei_janes.rnc188 kB

- Ime
- Janes-Norm.vert.zip
- Velikost
- 1.38 MB
- Format
- application/zip
- Opis
- Derived corpus in vertical format
- MD5
- 631fa5809be347f0269f8cc11abb5d7d