dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Erjavec, Tomaž |
dc.date.accessioned | 2021-05-13T19:23:06Z |
dc.date.available | 2021-05-13T19:23:06Z |
dc.date.issued | 2021-05-13 |
dc.identifier.uri | http://hdl.handle.net/11356/1429 |
dc.description | The Montenegrin web corpus meWaC was built by crawling the .me top-level domain in 2019. The corpus was near-deduplicated on paragraph level, normalised via transliteration into the Latin script, and morphosyntactically annotated, lemmatised and dependency-parsed with a prototype version of the classla pipeline (https://pypi.org/project/classla/). Each document is accompanied by the URL and title metadata. The corpus is available in CoNLL-U format and as vertical file (wilth included registry) for mounting on CQP-compatible concordancers. |
dc.language.iso | cnr |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://arxiv.org/abs/2104.09243 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.clarin.si/info/k-centre/ |
dc.subject | web corpus |
dc.title | Montenegrin web corpus meWaC 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds |
size.info | 321573 texts |
size.info | 3654071 sentences |
size.info | 90871077 tokens |
files.count | 2 |
files.size | 2650200686 |
featuredService.kontext | search|https://www.clarin.si/kontext/first_form?corpname=mewac |
featuredService.noske | search|https://www.clarin.si/ske/#dashboard?corpname=mewac |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- meWaC.conllu.zip
- Size
- 1.16 GB
- Format
- application/zip
- Description
- Corpus in CoNLL-U format
- MD5
- 5acfd8433934eca65bfc1276ae48c34a

- Name
- meWaC.vert.zip
- Size
- 1.31 GB
- Format
- application/zip
- Description
- Corpus in vertical format
- MD5
- ff174cecd130c51426e714a74a1d5e17