Show simple item record

 
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Krek, Simon
dc.date.accessioned 2015-06-06T22:24:21Z
dc.date.available 2015-06-06T22:24:21Z
dc.date.issued 2010-03-07
dc.identifier.uri http://hdl.handle.net/11356/1037
dc.description The jos1M corpus contains 1 million words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions (fine grained PoS tags) and lemmas, with about one fourth of the most problematic annotations hand-validated. The corpus is available in source TEI P5 XML and in the simpler and smaller vertical format, used by various concordancers. Note that the vertical format does not contain all of the information from the source TEI.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby http://www.lrec-conf.org/proceedings/lrec2010/summaries/139.html
dc.relation.isreplacedby http://hdl.handle.net/11356/1213
dc.rights Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc/4.0/
dc.rights.label PUB
dc.source.uri https://nl.ijs.si/jos/jos1M-en.html
dc.subject tagging
dc.subject lemmatisation
dc.subject manual annotation
dc.subject TEI
dc.title Training corpus jos1M 1.1
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
hidden hidden
has.files yes
branding CLARIN.SI data & tools
demo.uri https://nl.ijs.si/jos/jos1M/jos1Mv1_1_hdr-en.html
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) J2-9180 Linguistic annotation of Slovene nationalFunds
sponsor EU FP6 033917 SMART “Statistical Multilingual Analysis for Retrieval and Translation” Other
sponsor Ministry of Higher Education, Science and Technology European Fund for Regional Development Mobile reader for blind and sight impaired persons nationalFunds
size.info 1000019 words
files.count 3
files.size 26762709


 Files in this item

 Download all files in item (25.52 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
Distributed under Creative Commons Attribution Required Noncommercial
Icon
Name
jos1Mv1_1-xml.zip
Size
12.81 MB
Format
application/zip
Description
TEI encoded texts
MD5
8e7cd16c3d0709f455afc04ca46612f0
 Download file  Preview
 File Preview  
  • jos1M
    • jos1M-10.xml8 MB
    • jos1M-08.xml8 MB
    • jos1Mv1_1.xml1 MB
    • jos1M-07.xml8 MB
    • jos1M-06.xml8 MB
    • jos1M-05.xml8 MB
    • jos1M-04.xml8 MB
    • jos1M-03.xml8 MB
    • jos1M-02.xml9 MB
    • jos1M-01.xml8 MB
    • jos1M-09.xml8 MB
Icon
Name
jos1Mv1_1-en.zip
Size
6.34 MB
Format
application/zip
Description
Vertical format, MSDs in English
MD5
8b28c38845ace784f46e91629697f804
 Download file  Preview
 File Preview  
  • jos1M
    • jos1M-04-en.cqp2 MB
    • jos1M-09-en.cqp2 MB
    • josMSD-canon-en.tbl366 kB
    • jos1M-03-en.cqp2 MB
    • jos1M-10-en.cqp2 MB
    • jos1Mv1_1_hdr-en.html5 MB
    • jos1M-08-en.cqp2 MB
    • jos1M-02-en.cqp2 MB
    • jos1M-07-en.cqp2 MB
    • jos1M-01-en.cqp2 MB
    • jos1M-06-en.cqp2 MB
    • jos1M-05-en.cqp2 MB
Icon
Name
jos1Mv1_1-sl.zip
Size
6.37 MB
Format
application/zip
Description
Vertical format, MSDs in Slovene
MD5
43754624fe081bec907036642676fc67
 Download file  Preview
 File Preview  
  • jos1M
    • jos1M-07-sl.cqp2 MB
    • jos1M-01-sl.cqp2 MB
    • jos1M-06-sl.cqp2 MB
    • jos1M-05-sl.cqp2 MB
    • jos1Mv1_1_hdr-sl.html6 MB
    • jos1M-04-sl.cqp2 MB
    • josMSD-canon-sl.tbl381 kB
    • jos1M-10-sl.cqp2 MB
    • jos1M-09-sl.cqp2 MB
    • jos1M-03-sl.cqp2 MB
    • jos1M-08-sl.cqp2 MB
    • jos1M-02-sl.cqp2 MB

Show simple item record