dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Stojanovska, Biljana |
dc.date.accessioned | 2023-12-22T08:40:16Z |
dc.date.available | 2023-12-22T08:40:16Z |
dc.date.issued | 2023-12-20 |
dc.identifier.uri | http://hdl.handle.net/11356/1886 |
dc.description | The SETimes.MK corpus is a sample of 570 sentences from the now unavailable setimes.com website of news articles on topics of South-Eastern Europe. The sentences were manually corrected for sentence splitting and tokenisation, while the morphosyntactic labels (following the MULTEXT-East standard for Macedonian https://nl.ijs.si/ME/V6/msd/html/msd-mk.html) and lemmas were automatically annotated with two iterations of preliminary models for Macedonian in the CLASSLA-Stanza tool (https://pypi.org/project/classla/), after which they were manually corrected. The UPOS+UFEATS morphosyntactic description has been assigned with the mapper available at https://github.com/clarinsi/macedonian-tagset-mapping. The included sentences have their parallel counterparts inside the Croatian hr500k dataset (http://hdl.handle.net/11356/1792) and the Serbian SETimes.SR dataset (http://hdl.handle.net/11356/1843), and the sentence identifiers can be used to match corresponding sentences. Please note that the dataset does not completely follow the Universal Dependencies specifications for Macedonian (https://universaldependencies.org/mk/index.html), as the UPOS+FEATS features in the dataset take as their basis the MULTEXT-East specifications, which differ in certain respects from the Universal Dependencies for Macedonian one. |
dc.language.iso | mkd |
dc.publisher | Jožef Stefan Institute |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.clarin.si/info/k-centre/ |
dc.subject | manual annotation |
dc.subject | morphology |
dc.subject | lemmatisation |
dc.title | Macedonian linguistic training corpus SETimes.MK 0.1 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
size.info | 570 sentences |
size.info | 13310 tokens |
files.count | 1 |
files.size | 1148382 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- setimes.mk.0.1.conllu
- Size
- 1.1 MB
- Format
- Unknown
- Description
- CoNLL-U file
- MD5
- 76065be35c2b2e4c66d5c0dfe48a23a3