Show simple item record

 
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Fišer, Darja
dc.contributor.author Čibej, Jaka
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2016-12-30T14:02:38Z
dc.date.available 2016-12-30T14:02:38Z
dc.date.issued 2016-12-30
dc.identifier.uri http://hdl.handle.net/11356/1085
dc.description Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Slovene. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. A slightly older version of this corpus is described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf Note that a related corpus, Janes-Norm is also available, cf. http://hdl.handle.net/11356/1084.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.relation.replaces http://hdl.handle.net/11356/1079
dc.relation.replaces http://hdl.handle.net/11356/1081
dc.relation.isreplacedby http://hdl.handle.net/11356/1123
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://nl.ijs.si/janes/
dc.subject computer-mediated communication
dc.subject tokenisation
dc.subject word normalisation
dc.subject tagging
dc.subject lemmatisation
dc.subject manual annotation
dc.subject TEI
dc.title CMC training corpus Janes-Tag 1.2
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
hidden hidden
has.files yes
branding CLARIN.SI data & tools
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
sponsor Swiss National Science Foundation 160501 ReLDI Other
size.info 2958 texts
size.info 75276 tokens
files.count 5
files.size 2805327


 Files in this item

 Download all files in item (2.68 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
RASLAN16-Janes.pdf
Size
210.85 KB
Format
PDF
Description
RASLAN'16 paper describing the corpus
MD5
7487b904191a41f8cb38bbdfe12ba14e
 Download file
Icon
Name
Janes-smernice-v1.0.pdf
Size
339.69 KB
Format
PDF
Description
Annotation Guidelines (in Slovene)
MD5
39845f938d68b0e3330259eab31c6043
 Download file
Icon
Name
Janes-Tag.zip
Size
1.08 MB
Format
application/zip
Description
Corpus in TEI format
MD5
0741700ea329a7f58101009ccd74f5b0
 Download file  Preview
 File Preview  
  • Janes-Tag
    • msd-fslib-sl.xml461 kB
    • janes.tag.xml9 kB
    • schema
      • tei_janes_doc.html2 MB
      • tei_janes.rng399 kB
      • tei_janes_schema.xml2 kB
      • tei_janes.zip44 kB
      • tei_janes.rnc188 kB
    • janes.tag.body.xml5 MB
Icon
Name
Janes-Tag.vert.zip
Size
556.63 KB
Format
application/zip
Description
Derived corpus in vertical format
MD5
3327740d9a2e6316345a4db276ccae27
 Download file  Preview
 File Preview  
    • janes.tag.vert2 MB
    • janes.tag.regi1 kB
Icon
Name
Janes-Tag.vert.split.zip
Size
528.47 KB
Format
application/zip
Description
Corpus split into train, dev and test vertical files
MD5
18cb1bacac4a53310e9614737259fa1e
 Download file  Preview
 File Preview  
    • janes.tag.dev.vert244 kB
    • janes.tag.train.vert1 MB
    • janes.tag.test.vert246 kB

Show simple item record