Show simple item record Lenardič, Jakob Čibej, Jaka Arhar Holdt, Špela Erjavec, Tomaž Fišer, Darja 2022-12-06T13:44:02Z 2022-12-06T13:44:02Z 2022-12-06
dc.description Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 20,000 short texts (280,000 words), mostly tweets but also blogs, forums and news comments. The corpus is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, and word normalisation of non-standard Slovene. The corpus is composed of three parts. One is Janes-Norm 1.2 proper (5,000 texts and 93,000 words, texts to 2016), which has automatically assigned lemmas and morphosyntactic tags. The other two parts constitute the complete Janes-Tag 3.0 ( corpus, which has manually annotated morphosyntactic tagging, lemmatisation and named entity annotation (15,000 texts and 20,000 words). One part of Janes-Tag 3.0 is the older Janes-Tag 2.1 (texts to 2016) and the newer Janes-RSDO (tweets only, texts up to 2022). Both Janes-Norm and Janes-Tag (but not Janes-RSDO) have texts classified according to their estimated technical (T1-T3) and linguistic (L1-L3) standardness. The data is available in the source TEI encoding and in derived CoNLL-U format. All three parts contain lemmas and JOS/MULTEXT-East morphosyntactic descriptions, while Janes-Tag and Janes-RSDO also contain Universal Dependencies morphological features, and Janes-Tag also named entity annotations. Compared to the previous version, this one corrects some capitalisation errors in normalised words of Janes-Norm, updates the encoding, and adds Janes-RSDO. The first version of this corpus is described in: FIŠER, Darja, LJUBEŠIĆ, Nikola, ERJAVEC, Tomaž. 2018. The Janes project: language resources and tools for Slovene user generated content. Language Resources & Evaluation.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.label PUB
dc.subject computer-mediated communication
dc.subject tokenisation
dc.subject word normalisation
dc.subject manual annotation
dc.subject TEI
dc.title CMC training corpus Janes-Norm 3.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Jakob Lenardič Institute of Contemporary History
sponsor ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
sponsor Swiss National Science Foundation 160501 ReLDI Other
sponsor Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other 19771 texts 283258 words 327448 tokens
files.count 2
files.size 12750894
featuredService.kontext search|
featuredService.noske search|

 Files in this item

 Download all files in item (12.16 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
3.69 MB
Corpus in TEI format
 Download file  Preview
 File Preview  
  • Janes-Norm.3.0.TEI
    • janes-norm.xml5 MB
    • janes-rsdo.xml14 MB
    • Janes-Norm.3.0.xml17 kB
    • janes-tag.xml7 MB
    • schema
      • tei_clarin_example.xml48 kB
      • tei_clarin.rnc321 kB
      • tei_clarin_schema.xml70 kB
      • README.md525 B
      • tei_clarin.rng683 kB
    • 00README.txt1 kB
8.47 MB
Corpus in CoNLL-U format
 Download file  Preview
 File Preview  
  • Janes-Norm.3.0.CoNLL-U
    • janes-tag.jos.conllu7 MB
    • janes-tag.ud.conllu5 MB
    • janes-rsdo.jos.connlu14 MB
    • janes-norm.jos.conllu9 MB
    • tei2conllu.xsl26 kB
    • janes-rsdo.ud.connlu11 MB
    • 00README.txt1 kB
    • janes-norm.ud.conllu5 MB

Show simple item record