Montenegrin web corpus meWaC 1.0

Montenegrin web corpus meWaC 1.0

CLARIN.SI data & tools

Authors: Ljubešić, Nikola and Erjavec, Tomaž

Item identifier: http://hdl.handle.net/11356/1429

Project URL: https://www.clarin.si/info/k-centre/

Referenced by: https://arxiv.org/abs/2104.09243

Date issued: 2021-05-13

Type: corpus, text

Size: 321573 texts, 3654071 sentences, 90871077 tokens

Language(s): Montenegrin

Description: The Montenegrin web corpus meWaC was built by crawling the .me top-level domain in 2019. The corpus was near-deduplicated on paragraph level, normalised via transliteration into the Latin script, and morphosyntactically annotated, lemmatised and dependency-parsed with a prototype version of the classla pipeline (https://pypi.org/project/classla/). Each document is accompanied by the URL and title metadata. The corpus is available in CoNLL-U format and as vertical file (wilth included registry) for mounting on CQP-compatible concordancers.

Publisher: Jožef Stefan Institute

Subject(s): web corpus

Collection(s): CLARIN.SI data & tools

Show full item record

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: meWaC.conllu.zip
Size: 1.16 GB
Format: application/zip
Description: Corpus in CoNLL-U format
MD5: 5acfd8433934eca65bfc1276ae48c34a

Download file Preview

File Preview

- meWaC.conllu6 GB

Name: meWaC.vert.zip
Size: 1.31 GB
Format: application/zip
Description: Corpus in vertical format
MD5: ff174cecd130c51426e714a74a1d5e17

Download file Preview

File Preview

- meWaC.vert11 GB
- mewac.regi3 kB