2026-05-21T23:35:34Zhttp://www.clarin.si/repository/oai/request

oai:www.clarin.si:11356/10832023-03-27T17:01:18Zhdl_11356_1023hdl_11356_1024

CMC training corpus Janes-Norm 1.1 Erjavec, Tomaž Fišer, Darja Čibej, Jaka Arhar Holdt, Špela computer-mediated communication tokenisation word normalisation manual annotation TEI Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation and word normalisation of non-standard Slovene. The corpus is also automatically annotated with morphosyntactic descriptions and lemmas. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. The corpus is further described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf Note that a related corpus, Janes-Tag is also available, cf. http://hdl.handle.net/11356/1081. 2016-12-28 corpus http://hdl.handle.net/11356/1083 slv http://hdl.handle.net/11356/1084 Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) https://creativecommons.org/licenses/by-sa/4.0/ PUB application/pdf application/pdf application/zip application/zip text/plain; charset=utf-8 downloadable_files_count: 4 Jožef Stefan Institute https://nl.ijs.si/janes/