Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Erjavec, Tomaž
dc.date.accessioned 2021-05-13T19:23:06Z
dc.date.available 2021-05-13T19:23:06Z
dc.date.issued 2021-05-13
dc.identifier.uri http://hdl.handle.net/11356/1429
dc.description The Montenegrin web corpus meWaC was built by crawling the .me top-level domain in 2019. The corpus was near-deduplicated on paragraph level, normalised via transliteration into the Latin script, and morphosyntactically annotated, lemmatised and dependency-parsed with a prototype version of the classla pipeline (https://pypi.org/project/classla/). Each document is accompanied by the URL and title metadata. The corpus is available in CoNLL-U format and as vertical file (wilth included registry) for mounting on CQP-compatible concordancers.
dc.language.iso cnr
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://arxiv.org/abs/2104.09243
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.clarin.si/info/k-centre/
dc.subject web corpus
dc.title Montenegrin web corpus meWaC 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds
size.info 321573 texts
size.info 3654071 sentences
size.info 90871077 tokens
files.count 2
files.size 2650200686
featuredService.kontext search|https://www.clarin.si/kontext/first_form?corpname=mewac
featuredService.noske search|https://www.clarin.si/ske/#dashboard?corpname=mewac&struct_attr_stats=1


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
meWaC.conllu.zip
Size
1.16 GB
Format
application/zip
Description
Corpus in CoNLL-U format
MD5
5acfd8433934eca65bfc1276ae48c34a
 Download file  Preview
 File Preview  
    • meWaC.conllu6 GB
Icon
Name
meWaC.vert.zip
Size
1.31 GB
Format
application/zip
Description
Corpus in vertical format
MD5
ff174cecd130c51426e714a74a1d5e17
 Download file  Preview
 File Preview  
    • meWaC.vert11 GB
    • mewac.regi3 kB

Show simple item record