Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Markoski, Filip
dc.contributor.author Markoska, Elena
dc.contributor.author Erjavec, Tomaž
dc.date.accessioned 2021-05-12T13:50:22Z
dc.date.available 2021-05-12T13:50:22Z
dc.date.issued 2021-05-05
dc.identifier.uri http://hdl.handle.net/11356/1427
dc.description This comparable corpus collection consists of Wikipedia dumps of the Bosnian, Croatian, Macedonian, Montenegrin, Serbian, Serbo-Croatian and Slovenian Wikipedia, harvested on October 17th 2020. The text was extracted from the dumps with the process documented at https://github.com/clarinsi/classla-wikipedia, and linguistic annotation was performed with the classla package (https://pypi.org/project/classla/), on all levels available for a specific language, with the Bosnian and Serbo-Croatian Wikipedias processed with the standard Croatian models.
dc.language.iso bos
dc.language.iso hrv
dc.language.iso mkd
dc.language.iso cnr
dc.language.iso srp
dc.language.iso hbs
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://aclanthology.org/2021.ranlp-1.104.pdf
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://github.com/clarinsi/classla-wikipedia
dc.subject comparable corpus
dc.subject Wikipedia
dc.title Comparable corpora of South-Slavic Wikipedias CLASSLA-Wikipedia 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds
sponsor European Union’s Rights,Equality and Citizenship Programme 875263 IMSyPP - Innovative Monitoring Systems and PreventionPolicies of Online Hate Speech Other
size.info 1928450 articles
size.info 37677016 sentences
size.info 486258862 tokens
files.count 7
files.size 5409270773
featuredService.kontext Bulgarian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_bg
featuredService.kontext Bosnian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_bs
featuredService.kontext Croatian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_hr
featuredService.kontext Macedonian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_mk
featuredService.kontext Serbo-Croatian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_sh
featuredService.kontext Slovenian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_sl
featuredService.kontext Serbian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_sr
featuredService.noske Bulgarian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_bg&struct_attr_stats=1
featuredService.noske Bosnian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_bs&struct_attr_stats=1&subcorpora=1
featuredService.noske Croatian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_hr&struct_attr_stats=1&subcorpora=1
featuredService.noske Macedonian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_mk&struct_attr_stats=1&subcorpora=1
featuredService.noske Serbo-Croatian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_sh&struct_attr_stats=1&subcorpora=1
featuredService.noske Slovenian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_sl&struct_attr_stats=1&subcorpora=1
featuredService.noske Serbian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_sr&struct_attr_stats=1&subcorpora=1


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
classlawiki-bg.conllu.gz
Size
1.09 GB
Format
application/gzip
Description
Bulgarian Wikipedia
MD5
ed4f11056abc4f265e2ba995961683e0
 Download file
Icon
Name
classlawiki-bs.conllu.gz
Size
258.13 MB
Format
application/gzip
Description
Bosnian Wikipedia
MD5
3f82c98d4ff85a1c41eade4ba194c585
 Download file
Icon
Name
classlawiki-hr.conllu.gz
Size
745.75 MB
Format
application/gzip
Description
Croatian Wikipedia
MD5
a60256fccf9203845ab27565e0ee7362
 Download file
Icon
Name
classlawiki-mk.conllu.gz
Size
422.35 MB
Format
application/gzip
Description
Macedonian Wikipedia
MD5
76320c89294557ef8115911ed1cab18f
 Download file
Icon
Name
classlawiki-sh.conllu.gz
Size
753.29 MB
Format
application/gzip
Description
Serbo-Croatian Wikipedia
MD5
cb33f6f5f16ac36c97279b429d0c23e3
 Download file
Icon
Name
classlawiki-sl.conllu.gz
Size
620.25 MB
Format
application/gzip
Description
Slovenian Wikipedia
MD5
43b88ff0d5d6b9b11d6a49e933819243
 Download file
Icon
Name
classlawiki-sr.conllu.gz
Size
1.22 GB
Format
application/gzip
Description
Serbian Wikipedia
MD5
4229eb9ac932e95498ad9e33f0453c5a
 Download file

Show simple item record