dc.contributor.author | Lenardič, Jakob |
dc.contributor.author | Čibej, Jaka |
dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Fišer, Darja |
dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Zupan, Katja |
dc.contributor.author | Dobrovoljc, Kaja |
dc.date.accessioned | 2022-12-06T13:46:18Z |
dc.date.available | 2022-12-06T13:46:18Z |
dc.date.issued | 2022-12-06 |
dc.identifier.uri | http://hdl.handle.net/11356/1732 |
dc.description | Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC) consisting of about 15,000 short texts (190,000 words), mostly tweets but also blogs, forums and news comments. The corpus is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity annotation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. The corpus is composed of two parts, the older (texts to 2016) and smaller (65,000 words) Janes Tag 2.1, and the tweet-only newer (2022, 125,000 words) Janes RSDO. Only the Janes Tag 2.1 part is annotated with named entities and with classification of the texts according to their estimated technical (T1-T3) and linguistic (L1-L3) standardness. The data is available in the source TEI encoding and in derived CoNLL-U format. Both contain JOS/MULTEXT-East morphosyntactic descriptions as well as Universal Dependencies morphological features. Compared to the previous version, this one corrects some errors, updates the encoding, and adds Janes-RSDO. The first version of this corpus is described in: FIŠER, Darja, LJUBEŠIĆ, Nikola, ERJAVEC, Tomaž. 2020. The Janes project: language resources and tools for Slovene user generated content. Language Resources and Evaluation. https://doi.org/10.1007/s10579-018-9425-z Note that a related corpus, Janes-Norm 3.0 (http://hdl.handle.net/11356/1733), is also available. It contains Janes-Tag 3.0 and an additional subcorpus with manually checked sentences, tokens and normalised words but only automatically assigned lemmas and MULTEXT-East MSDs. |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://nl.ijs.si/janes/viri/rocno-oznaceni-korpusi/#Janes-Tag |
dc.relation.isreferencedby | https://doi.org/10.1007/s10579-018-9425-z |
dc.relation.replaces | http://hdl.handle.net/11356/1238 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://nl.ijs.si/janes/ |
dc.subject | computer-mediated communication |
dc.subject | tokenisation |
dc.subject | word normalisation |
dc.subject | part-of-speech tagging |
dc.subject | lemmatisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.subject | named entities |
dc.title | CMC training corpus Janes-Tag 3.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Jakob Lenardič jakob.lenardic@inz.si Institute of Contemporary History |
sponsor | ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
sponsor | Swiss National Science Foundation 160501 ReLDI Other |
sponsor | ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other |
size.info | 14913 texts |
size.info | 217774 tokens |
size.info | 190268 words |
files.count | 2 |
files.size | 9054146 |
featuredService.kontext | search|https://www.clarin.si/kontext/query?corpname=janes_tag30 |
featuredService.noske | search|https://www.clarin.si/ske/#dashboard?corpname=janes_tag30 |
Files in this item
Download all files in item (8.63 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- Janes-Tag.3.0.TEI.zip
- Size
- 2.74 MB
- Format
- application/zip
- Description
- Corpus in TEI format
- MD5
- 66ff9bc7b8c1d0147a77314bfaf3f4c8
- Janes-Tag.3.0.TEI
- janes-rsdo.xml14 MB
- janes-tag.xml7 MB
- schema
- tei_clarin_example.xml48 kB
- tei_clarin.rnc321 kB
- tei_clarin_schema.xml70 kB
- README.md525 B
- tei_clarin.rng683 kB
- Janes-Tag.3.0.xml16 kB
- 00README.txt1 kB

- Name
- Janes-Tag.3.0.CoNLL-U.zip
- Size
- 5.9 MB
- Format
- application/zip
- Description
- Corpus in CoNLL-U format
- MD5
- e1166f9047438817b558d435f55def5b
- Janes-Tag.3.0.CoNLL-U
- janes-tag.jos.conllu7 MB
- janes-tag.ud.conllu5 MB
- janes-rsdo.jos.connlu14 MB
- tei2conllu.xsl26 kB
- janes-rsdo.ud.connlu11 MB
- 00README.txt1 kB