dc.contributor.author | Lenardič, Jakob |
dc.contributor.author | Čibej, Jaka |
dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Fišer, Darja |
dc.date.accessioned | 2022-12-06T13:44:02Z |
dc.date.available | 2022-12-06T13:44:02Z |
dc.date.issued | 2022-12-06 |
dc.identifier.uri | http://hdl.handle.net/11356/1733 |
dc.description | Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 20,000 short texts (280,000 words), mostly tweets but also blogs, forums and news comments. The corpus is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, and word normalisation of non-standard Slovene. The corpus is composed of three parts. One is Janes-Norm 1.2 proper (5,000 texts and 93,000 words, texts to 2016), which has automatically assigned lemmas and morphosyntactic tags. The other two parts constitute the complete Janes-Tag 3.0 (http://hdl.handle.net/11356/1732) corpus, which has manually annotated morphosyntactic tagging, lemmatisation and named entity annotation (15,000 texts and 20,000 words). One part of Janes-Tag 3.0 is the older Janes-Tag 2.1 (texts to 2016) and the newer Janes-RSDO (tweets only, texts up to 2022). Both Janes-Norm and Janes-Tag (but not Janes-RSDO) have texts classified according to their estimated technical (T1-T3) and linguistic (L1-L3) standardness. The data is available in the source TEI encoding and in derived CoNLL-U format. All three parts contain lemmas and JOS/MULTEXT-East morphosyntactic descriptions, while Janes-Tag and Janes-RSDO also contain Universal Dependencies morphological features, and Janes-Tag also named entity annotations. Compared to the previous version, this one corrects some capitalisation errors in normalised words of Janes-Norm, updates the encoding, and adds Janes-RSDO. The first version of this corpus is described in: FIŠER, Darja, LJUBEŠIĆ, Nikola, ERJAVEC, Tomaž. 2018. The Janes project: language resources and tools for Slovene user generated content. Language Resources & Evaluation. https://rdcu.be/7RX4 |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://nl.ijs.si/janes/viri/rocno-oznaceni-korpusi/#Janes-Norm |
dc.relation.isreferencedby | https://doi.org/10.1007/s10579-018-9425-z |
dc.relation.replaces | http://hdl.handle.net/11356/1084 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://nl.ijs.si/janes/ |
dc.subject | computer-mediated communication |
dc.subject | tokenisation |
dc.subject | word normalisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.title | CMC training corpus Janes-Norm 3.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Jakob Lenardič jakob.lenardic@inz.si Institute of Contemporary History |
sponsor | ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
sponsor | Swiss National Science Foundation 160501 ReLDI Other |
sponsor | Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other |
size.info | 19771 texts |
size.info | 283258 words |
size.info | 327448 tokens |
files.count | 2 |
files.size | 12750894 |
featuredService.kontext | search|https://www.clarin.si/kontext/query?corpname=janes_norm30 |
featuredService.noske | search|https://www.clarin.si/ske/#dashboard?corpname=janes_norm30 |
Files in this item
Download all files in item (12.16 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- Janes-Norm.3.0.TEI.zip
- Size
- 3.69 MB
- Format
- application/zip
- Description
- Corpus in TEI format
- MD5
- 39ac6cc230fc6229480eeed00e433695
- Janes-Norm.3.0.TEI
- janes-norm.xml5 MB
- janes-rsdo.xml14 MB
- Janes-Norm.3.0.xml17 kB
- janes-tag.xml7 MB
- schema
- tei_clarin_example.xml48 kB
- tei_clarin.rnc321 kB
- tei_clarin_schema.xml70 kB
- README.md525 B
- tei_clarin.rng683 kB
- 00README.txt1 kB

- Name
- Janes-Norm.3.0.CoNLL-U.zip
- Size
- 8.47 MB
- Format
- application/zip
- Description
- Corpus in CoNLL-U format
- MD5
- d5f02ff0c6a7ca43ccf138760b5654d7
- Janes-Norm.3.0.CoNLL-U
- janes-tag.jos.conllu7 MB
- janes-tag.ud.conllu5 MB
- janes-rsdo.jos.connlu14 MB
- janes-norm.jos.conllu9 MB
- tei2conllu.xsl26 kB
- janes-rsdo.ud.connlu11 MB
- 00README.txt1 kB
- janes-norm.ud.conllu5 MB