The Montenegrin web corpus meWaC was built by crawling the .me top-level domain in 2019. The corpus was near-deduplicated on paragraph level, normalised via transliteration into the Latin script, and morphosyntactically annotated, lemmatised and dependency-parsed with a prototype version of the classla pipeline (https://pypi.org/project/classla/). Each document is accompanied by the URL and title metadata.
The corpus is available in CoNLL-U format and as vertical file (wilth included registry) for mounting on CQP-compatible concordancers.
Jožef Stefan InstituteCLARIN"CLARIN.SI"ARRS (Slovenian Research Agency)P6-0411"Language Resources and Technologies for Slovene"ARRS (Slovenian Research Agency)N6-0099"LiLaH: Linguistic Landscape of Hate Speech"