dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Markoski, Filip |
dc.contributor.author | Markoska, Elena |
dc.contributor.author | Erjavec, Tomaž |
dc.date.accessioned | 2021-05-12T13:50:22Z |
dc.date.available | 2021-05-12T13:50:22Z |
dc.date.issued | 2021-05-05 |
dc.identifier.uri | http://hdl.handle.net/11356/1427 |
dc.description | This comparable corpus collection consists of Wikipedia dumps of the Bosnian, Croatian, Macedonian, Montenegrin, Serbian, Serbo-Croatian and Slovenian Wikipedia, harvested on October 17th 2020. The text was extracted from the dumps with the process documented at https://github.com/clarinsi/classla-wikipedia, and linguistic annotation was performed with the classla package (https://pypi.org/project/classla/), on all levels available for a specific language, with the Bosnian and Serbo-Croatian Wikipedias processed with the standard Croatian models. |
dc.language.iso | bos |
dc.language.iso | hrv |
dc.language.iso | mkd |
dc.language.iso | cnr |
dc.language.iso | srp |
dc.language.iso | hbs |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://aclanthology.org/2021.ranlp-1.104.pdf |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://github.com/clarinsi/classla-wikipedia |
dc.subject | comparable corpus |
dc.subject | Wikipedia |
dc.title | Comparable corpora of South-Slavic Wikipedias CLASSLA-Wikipedia 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds |
sponsor | European Union’s Rights,Equality and Citizenship Programme 875263 IMSyPP - Innovative Monitoring Systems and PreventionPolicies of Online Hate Speech Other |
size.info | 1928450 articles |
size.info | 37677016 sentences |
size.info | 486258862 tokens |
files.count | 7 |
files.size | 5409270773 |
featuredService.kontext | Bulgarian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_bg |
featuredService.kontext | Bosnian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_bs |
featuredService.kontext | Croatian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_hr |
featuredService.kontext | Macedonian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_mk |
featuredService.kontext | Serbo-Croatian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_sh |
featuredService.kontext | Slovenian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_sl |
featuredService.kontext | Serbian Wikipedia|https://www.clarin.si/kontext/first_form?corpname=classlawiki_sr |
featuredService.noske | Bulgarian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_bg |
featuredService.noske | Bosnian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_bs |
featuredService.noske | Croatian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_hr |
featuredService.noske | Macedonian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_mk |
featuredService.noske | Serbo-Croatian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_sh |
featuredService.noske | Slovenian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_sl |
featuredService.noske | Serbian Wikipedia|https://www.clarin.si/ske/#dashboard?corpname=classlawiki_sr |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
z licenco:Creative Commons - Attribution 4.0 International (CC BY 4.0)



- Ime
- classlawiki-bg.conllu.gz
- Velikost
- 1.09 GB
- Format
- application/gzip
- Opis
- Bulgarian Wikipedia
- MD5
- ed4f11056abc4f265e2ba995961683e0

- Ime
- classlawiki-bs.conllu.gz
- Velikost
- 258.13 MB
- Format
- application/gzip
- Opis
- Bosnian Wikipedia
- MD5
- 3f82c98d4ff85a1c41eade4ba194c585

- Ime
- classlawiki-hr.conllu.gz
- Velikost
- 745.75 MB
- Format
- application/gzip
- Opis
- Croatian Wikipedia
- MD5
- a60256fccf9203845ab27565e0ee7362

- Ime
- classlawiki-mk.conllu.gz
- Velikost
- 422.35 MB
- Format
- application/gzip
- Opis
- Macedonian Wikipedia
- MD5
- 76320c89294557ef8115911ed1cab18f

- Ime
- classlawiki-sh.conllu.gz
- Velikost
- 753.29 MB
- Format
- application/gzip
- Opis
- Serbo-Croatian Wikipedia
- MD5
- cb33f6f5f16ac36c97279b429d0c23e3

- Ime
- classlawiki-sl.conllu.gz
- Velikost
- 620.25 MB
- Format
- application/gzip
- Opis
- Slovenian Wikipedia
- MD5
- 43b88ff0d5d6b9b11d6a49e933819243

- Ime
- classlawiki-sr.conllu.gz
- Velikost
- 1.22 GB
- Format
- application/gzip
- Opis
- Serbian Wikipedia
- MD5
- 4229eb9ac932e95498ad9e33f0453c5a