Show simple item record

 
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Fišer, Darja
dc.contributor.author Čibej, Jaka
dc.contributor.author Arhar Holdt, Špela
dc.date.accessioned 2016-12-18T16:00:18Z
dc.date.available 2016-12-22T10:10:03Z
dc.date.issued 2016-12-22
dc.identifier.uri http://hdl.handle.net/11356/1079
dc.description Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require higlhy accurate and reliable annotations. The corpus is further described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1080.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.relation.isreplacedby http://hdl.handle.net/11356/1081
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://nl.ijs.si/janes/
dc.subject computer-mediated communication
dc.subject tokenisation
dc.subject word normalisation
dc.subject tagging
dc.subject lemmatisation
dc.subject manual annotation
dc.subject TEI
dc.title CMC training corpus Janes-Tag 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds
size.info 2958 texts
size.info 75276 tokens
files.count 4
files.size 2244018


 Files in this item

 Download all files in item (2.14 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
RASLAN16-Janes.pdf
Size
210.85 KB
Format
PDF
Description
RASLAN'16 paper describing the corpus
MD5
7487b904191a41f8cb38bbdfe12ba14e
 Download file
Icon
Name
Janes-smernice-v1.0.pdf
Size
339.69 KB
Format
PDF
Description
Annotation Guidelines (in Slovene)
MD5
39845f938d68b0e3330259eab31c6043
 Download file
Icon
Name
Janes-Tag.vert.zip
Size
536.93 KB
Format
application/zip
Description
Corpus in vertical format
MD5
6919d06b5011bb6b8cb739eb11271dc1
 Download file  Preview
 File Preview  
    • janes.tag.vert2 MB
    • janes.tag.regi870 B
Icon
Name
Janes-Tag.zip
Size
1.08 MB
Format
application/zip
Description
Corpus in TEI format
MD5
e484a07e41606415f801e749f9efd6ff
 Download file  Preview
 File Preview  
  • Janes-Tag
    • msd-fslib-sl.xml461 kB
    • janes.tag.xml9 kB
    • schema
      • tei_janes_doc.html2 MB
      • tei_janes.rng399 kB
      • tei_janes_schema.xml2 kB
      • tei_janes.zip44 kB
      • tei_janes.rnc188 kB
    • janes.tag.body.xml5 MB

Show simple item record