dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Fišer, Darja |
dc.date.accessioned | 2017-01-03T11:38:46Z |
dc.date.available | 2017-01-03T11:38:46Z |
dc.date.issued | 2017-01-03 |
dc.identifier.uri | http://hdl.handle.net/11356/1086 |
dc.description | Janes-Syn is a syntactically annotated corpus of Slovene tweets and is meant as a gold-standard training and testing dataset for syntactic annotation of Slovene computer-mediated communication and for detailed linguistic explorations which require highly accurate and reliable annotations. Words in the dataset are normalised, lemmatised, PoS-tagged and syntactically annotated with the JOS dependency model (http://eng.slovenscina.eu/tehnologije/razclenjevalnik). The annotations on all levels were manually corrected. The corpus creation and structure are described in: ARHAR HOLDT, Špela, FIŠER, Darja, ERJAVEC, Tomaž, KREK, Simon. Syntactic annotation of Slovene CMC : first steps. Proceedings of the 4th Conference on CMC and Social Media Corpora for the Humanities, 27-28 September 2016, Ljubljana, Slovenia, 2016, pp. 3-6. https://nl.ijs.si/janes/cmc-corpora2016/proceedings/ Janes-Syn was created from two larger corpora that are also available in the repository: Janes-Norm (http://hdl.handle.net/11356/1084) and Janes-Tag (http://hdl.handle.net/11356/1123). |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://nl.ijs.si/janes/viri/rocno-oznaceni-korpusi/#Janes-Syn |
dc.relation.isreferencedby | https://doi.org/10.1007/s10579-018-9425-z |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://nl.ijs.si/janes/ |
dc.subject | computer-mediated communication |
dc.subject | tokenisation |
dc.subject | dependency treebank |
dc.subject | syntactic annotation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.title | CMC training corpus Janes-Syn 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
size.info | 168 texts |
size.info | 4388 tokens |
size.info | 3734 words |
files.count | 4 |
files.size | 1705400 |
featuredService.kontext | Search|https://www.clarin.si/kontext/first_form?corpname=janes_syn |
featuredService.noske | Search|https://www.clarin.si/ske/#dashboard?corpname=janes_syn |
Files in this item
Download all files in item (1.63 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- CMC-2016_Arhar_et_al_Syntactic-Annotation-of-Slovene-CMC.pdf
- Size
- 273.92 KB
- Format
- Description
- CMC'16 paper describing the corpus
- MD5
- c316828282b4e33d3c26a7738f41f0b7

- Name
- Janes-skladnja-v1.0.pdf
- Size
- 834.5 KB
- Format
- Description
- Annotation Guidelines (in Slovene)
- MD5
- 122f8dcb2043c8ef7dff3bbb5c9d51e8

- Name
- Janes-Syn.zip
- Size
- 493.23 KB
- Format
- application/zip
- Description
- Corpus in TEI format
- MD5
- f8ce611ea9d52037c8481a55325d2a4f

- Name
- Janes-Syn.vert.zip
- Size
- 63.78 KB
- Format
- application/zip
- Description
- Corpus in vertical format
- MD5
- 07f0fc164924709be93aadea4c6455ec