Prikaži enostavni zapis vnosa

 
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Fišer, Darja
dc.contributor.author Čibej, Jaka
dc.contributor.author Arhar Holdt, Špela
dc.date.accessioned 2016-12-28T11:41:07Z
dc.date.available 2016-12-28T11:41:07Z
dc.date.issued 2016-12-28
dc.identifier.uri http://hdl.handle.net/11356/1083
dc.description Janes-Norm is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation and word normalisation of non-standard Slovene. The corpus is also automatically annotated with morphosyntactic descriptions and lemmas. As the corpus has been carefully manually annotated, it is also suitable for detailed linguistic explorations which require highly accurate and reliable annotations. The corpus is further described in: ERJAVEC, Tomaž, ČIBEJ, Jaka, ARHAR HOLDT, Špela, LJUBEŠIĆ, Nikola, FIŠER, Darja. Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2016, pp. 29-40, https://nlp.fi.muni.cz/raslan/raslan16.pdf Note that a related corpus, Janes-Tag is also available, cf. http://hdl.handle.net/11356/1081.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.relation.isreplacedby http://hdl.handle.net/11356/1084
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://nl.ijs.si/janes/
dc.subject computer-mediated communication
dc.subject tokenisation
dc.subject word normalisation
dc.subject manual annotation
dc.subject TEI
dc.title CMC training corpus Janes-Norm 1.1
dc.type corpus
dcterms.isReplacedBy http://hdl.handle.net/11356/1084
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
sponsor Swiss National Science Foundation 160501 ReLDI Other
size.info 7816 texts
size.info 184766 tokens
files.count 4
files.size 4155553


 Datoteke v tem vnosu

 Prenesi vse datoteke v vnosu (3.96 MB)
Icon
Ime
RASLAN16-Janes.pdf
Velikost
210.85 KB
Format
PDF
Opis
RASLAN'16 paper describing the corpus
MD5
7487b904191a41f8cb38bbdfe12ba14e
 Prenesi datoteko
Icon
Ime
Janes-smernice-v1.0.pdf
Velikost
339.69 KB
Format
PDF
Opis
Annotation Guidelines (in Slovene)
MD5
39845f938d68b0e3330259eab31c6043
 Prenesi datoteko
Icon
Ime
Janes-Norm.zip
Velikost
2.05 MB
Format
application/zip
Opis
Corpus in TEI format
MD5
9d94f664c34da23f364fef95b23644c9
 Prenesi datoteko  Predogled
 Predogled datoteke  
  • Janes-Norm
    • janes.norm.xml12 kB
    • msd-fslib-sl.xml461 kB
    • janes.norm.body.xml14 MB
    • schema
      • tei_janes_doc.html2 MB
      • tei_janes.rng399 kB
      • tei_janes_schema.xml2 kB
      • tei_janes.zip44 kB
      • tei_janes.rnc188 kB
Icon
Ime
Janes-Norm.vert.zip
Velikost
1.38 MB
Format
application/zip
Opis
Derived corpus in vertical format
MD5
631fa5809be347f0269f8cc11abb5d7d
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • janes.norm.vert6 MB
    • janes.norm.regi892 B

Prikaži enostavni zapis vnosa