CMC training corpus Janes-Syn 1.0

Name: CMC training corpus Janes-Syn 1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Arhar Holdt, Špela; Erjavec, Tomaž; Fišer, Darja

dc.contributor.author	Arhar Holdt, Špela
dc.contributor.author	Erjavec, Tomaž
dc.contributor.author	Fišer, Darja
dc.date.accessioned	2017-01-03T11:38:46Z
dc.date.available	2017-01-03T11:38:46Z
dc.date.issued	2017-01-03
dc.identifier.uri	http://hdl.handle.net/11356/1086
dc.description	Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard training and testing dataset for syntactic annotation of Slovene computer-mediated communication and for detailed linguistic explorations which require highly accurate and reliable annotations. Words in the dataset are normalised, lemmatised, PoS-tagged and syntactically annotated with the JOS dependency model (http://eng.slovenscina.eu/tehnologije/razclenjevalnik). The annotations on all levels were manually corrected. The corpus creation and structure are described in: ARHAR HOLDT, Špela, FIŠER, Darja, ERJAVEC, Tomaž, KREK, Simon. Syntactic annotation of Slovene CMC : first steps. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 27-28 September 2016, Ljubljana, Slovenia, 2016, pp. 3-6. https://nl.ijs.si/janes/cmc-corpora2016/proceedings/ Janes-Syn was created from two larger corpora that are also available in the repository: Janes-Norm (http://hdl.handle.net/11356/1084) and Janes-Tag (http://hdl.handle.net/11356/1123).
dc.language.iso	slv
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://nl.ijs.si/janes/viri/rocno-oznaceni-korpusi/#Janes-Syn
dc.relation.isreferencedby	https://doi.org/10.1007/s10579-018-9425-z
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://nl.ijs.si/janes/
dc.subject	computer-mediated communication
dc.subject	tokenisation
dc.subject	dependency treebank
dc.subject	syntactic annotation
dc.subject	manual annotation
dc.subject	TEI
dc.title	CMC training corpus Janes-Syn 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor	ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds
sponsor	ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
size.info	168 texts
size.info	4388 tokens
size.info	3734 words
files.count	4
files.size	1705400
featuredService.kontext	Search\|https://www.clarin.si/kontext/first_form?corpname=janes_syn
featuredService.noske	Search\|https://www.clarin.si/ske/#dashboard?corpname=janes_syn

Files in this item

Download all files in item (1.63 MB)

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: CMC-2016_Arhar_et_al_Syntactic-Annotation-of-Slovene-CMC.pdf
Size: 273.92 KB
Format: PDF
Description: CMC'16 paper describing the corpus
MD5: c316828282b4e33d3c26a7738f41f0b7

Download file

Name: Janes-skladnja-v1.0.pdf
Size: 834.5 KB
Format: PDF
Description: Annotation Guidelines (in Slovene)
MD5: 122f8dcb2043c8ef7dff3bbb5c9d51e8

Download file

Name: Janes-Syn.zip
Size: 493.23 KB
Format: application/zip
Description: Corpus in TEI format
MD5: f8ce611ea9d52037c8481a55325d2a4f

Download file Preview

File Preview

Janes-Syn
- msd-fslib-sl.xml461 kB
- janes.syn.body.xml788 kB
- schema
  - tei_janes_doc.html2 MB
  - tei_janes.rng399 kB
  - tei_janes_schema.xml2 kB
  - tei_janes.zip44 kB
  - tei_janes.rnc188 kB
- janes.syn.xml21 kB

Name: Janes-Syn.vert.zip
Size: 63.78 KB
Format: application/zip
Description: Corpus in vertical format
MD5: 07f0fc164924709be93aadea4c6455ec

Download file Preview

File Preview

- janes.syn.vert374 kB
- janes.syn.regi1 kB

Show simple item record

Files in this item

Partners

Partners

Repository