Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Stojanovska, Biljana
dc.date.accessioned 2023-12-22T08:40:16Z
dc.date.available 2023-12-22T08:40:16Z
dc.date.issued 2023-12-20
dc.identifier.uri http://hdl.handle.net/11356/1886
dc.description The SETimes.MK corpus is a sample of 570 sentences from the now unavailable setimes.com website of news articles on topics of South-Eastern Europe. The sentences were manually corrected for sentence splitting and tokenisation, while the morphosyntactic labels (following the MULTEXT-East standard for Macedonian https://nl.ijs.si/ME/V6/msd/html/msd-mk.html) and lemmas were automatically annotated with two iterations of preliminary models for Macedonian in the CLASSLA-Stanza tool (https://pypi.org/project/classla/), after which they were manually corrected. The UPOS+UFEATS morphosyntactic description has been assigned with the mapper available at https://github.com/clarinsi/macedonian-tagset-mapping. The included sentences have their parallel counterparts inside the Croatian hr500k dataset (http://hdl.handle.net/11356/1792) and the Serbian SETimes.SR dataset (http://hdl.handle.net/11356/1843), and the sentence identifiers can be used to match corresponding sentences. Please note that the dataset does not completely follow the Universal Dependencies specifications for Macedonian (https://universaldependencies.org/mk/index.html), as the UPOS+FEATS features in the dataset take as their basis the MULTEXT-East specifications, which differ in certain respects from the Universal Dependencies for Macedonian one.
dc.language.iso mkd
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.clarin.si/info/k-centre/
dc.subject manual annotation
dc.subject morphology
dc.subject lemmatisation
dc.title Macedonian linguistic training corpus SETimes.MK 0.1
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 570 sentences
size.info 13310 tokens
files.count 1
files.size 1148382


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
setimes.mk.0.1.conllu
Size
1.1 MB
Format
Unknown
Description
CoNLL-U file
MD5
76065be35c2b2e4c66d5c0dfe48a23a3
 Download file

Show simple item record