| dc.contributor.author | Kuzman Pungeršek, Taja |
| dc.contributor.author | Rupnik, Peter |
| dc.contributor.author | Ljubešić, Nikola |
| dc.date.accessioned | 2026-01-27T11:06:09Z |
| dc.date.available | 2026-01-27T11:06:09Z |
| dc.date.issued | 2026-01-27 |
| dc.identifier.uri | http://hdl.handle.net/11356/2079 |
| dc.description | The CLASSLA-web 2.0 collection is a large-scale, comparable set of web corpora covering all seven South Slavic languages: Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian. This second major CLASSLA-web release follows the methodology of the CLASSLA-web 1.0 corpus collection while providing more recent texts and additional annotation layers, including automatic topic annotation alongside genre classification. The collection comprises approximately 17 billion words across 38 million texts: 2.31B words in the Slovenian corpus, 3.01B in the Croatian corpus, 1.01B in the Bosnian corpus, 294M in the Montenegrin corpus, 3.71B in the Serbian corpus, 691M in the Macedonian corpus, and 5.99B words in the Bulgarian corpus. Detailed size statistics for each corpus are provided in the accompanying README file. Each corpus in the CLASSLA-web 2.0 collection is based on dedicated web crawls of the corresponding national top-level domains (TLDs) and connected general domains (e.g. .com), namely, .si for Slovenian, .hr for Croatian, .ba for Bosnian, .me for Montenegrin, .rs and .срб for Serbian, .mk and .мкд for Macedonian, and .bg and .бг for Bulgarian. All texts were collected in 2024. The corpora are linguistically annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla). Linguistic processing included tokenization, morphosyntactic annotation, and lemmatization. Each corpus was further automatically annotated with genre labels using the X-GENRE classifier (http://doi.org/10.57967/hf/0927) and with topic labels using the IPTC news topic classifier (http://doi.org/10.57967/hf/4709). Additional details on corpus construction are available at https://clarinsi.github.io/classla-web/. The CLASSLA-web 2.0 corpora are distributed in two complementary formats. In JSONL format, each web document is represented in a single line containing a complete JSON object with document-level metadata and full text, enabling efficient line-by-line processing of large datasets. This format is primarily intended for downloading, filtering, and offline processing. Two JSONL files are provided for each corpus, with the suffixes .jsonl and .anno.jsonl. Both files are identical, however, the .anno.jsonl version additionally includes linguistically annotated text in CoNLL-U format. The second format is the so called vertical format (VERT): a vertically tokenized, XML-like representation that integrates document-, paragraph-, sentence-, and token-level information together with linguistic annotation, and can be used by (no)Sketch Engine and CWB concordancers. The provided document-level metadata in both formats include document ID, title, URL, domain, top-level domain (tld), language, script (Latin or Cyrillic, applicable to the Bosnian, Croatian, Montenegrin, and Serbian corpora), year of crawling, and predicted genre and topic categories. Further details on metadata attributes and formats are provided in the accompanying README file. Compared to CLASSLA-web 1.0 (collected in 2021–2022), the new release provides a substantially larger and more recent snapshot of web content, with only about 20 percent textual overlap between the two versions. The new release additionally includes topic annotations alongside genre labels and is distributed in the widely used JSONL and VERT formats. The CLASSLA-web 1.0 corpora were published as separate entries, namely Bosnian (https://hdl.handle.net/11356/1927), Bulgarian (https://hdl.handle.net/11356/1928), Croatian (https://hdl.handle.net/11356/1929), Macedonian (https://hdl.handle.net/11356/1932), Montenegrin (https://hdl.handle.net/11356/1930), Serbian (https://hdl.handle.net/11356/1931) and Slovenian (https://hdl.handle.net/11356/1882). Notice and take down: Should you consider that our data contains material that is owned by you and should not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. |
| dc.language.iso | bos |
| dc.language.iso | bul |
| dc.language.iso | hrv |
| dc.language.iso | mkd |
| dc.language.iso | cnr |
| dc.language.iso | srp |
| dc.language.iso | slv |
| dc.publisher | Jožef Stefan Institute |
| dc.relation.isreferencedby | https://doi.org/10.48550/arXiv.2601.11170 |
| dc.rights | CC0-No Rights Reserved |
| dc.rights.uri | https://creativecommons.org/publicdomain/zero/1.0/ |
| dc.rights.label | PUB |
| dc.source.uri | https://clarinsi.github.io/classla-web/ |
| dc.subject | web corpus |
| dc.subject | automatic genre identification |
| dc.subject | genre corpus |
| dc.subject | web crawling |
| dc.subject | web |
| dc.subject | topic classification |
| dc.subject | topic |
| dc.title | South Slavic web corpus collection CLASSLA-web 2.0 |
| dc.type | corpus |
| metashare.ResourceInfo#ContentInfo.mediaType | text |
| has.files | yes |
| branding | CLARIN.SI data & tools |
| contact.person | Taja Kuzman Pungeršek taja.kuzman@ijs.si Jožef Stefan Institute |
| sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
| sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
| sponsor | ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds |
| size.info | 17010802368 words |
| size.info | 38057171 texts |
| files.count | 22 |
| files.size | 488068698480 |
| featuredService.noske | Bosnian|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_bs |
| featuredService.noske | Bulgarian |https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_bg |
| featuredService.noske | Croatian|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_hr |
| featuredService.noske | Macedonian|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_mk |
| featuredService.noske | Montenegrin|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_cnr |
| featuredService.noske | Serbian|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_sr |
| featuredService.noske | Slovenian|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_sl |
Files in this item
- Name
- README.md
- Size
- 40.77 KB
- Format
- Unknown
- Description
- Documentation on corpora size, format and content
- MD5
- efe94358cbd6e33b4152b549d13b9cac
- Name
- CLASSLA-web.bg.2.0.anno.jsonl.gz
- Size
- 86.63 GB
- Format
- application/gzip
- Description
- JSONL file with linguistic annotation (672.85 GB uncompressed)
- MD5
- 03b4664a3c8403c2fd955561de535329
- Name
- CLASSLA-web.bg.2.0.jsonl.gz
- Size
- 18.08 GB
- Format
- application/gzip
- Description
- JSONL file (68.09 GB uncompressed)
- MD5
- cfba393ca4503831f7b956a18f43f411
- Name
- CLASSLA-web.bg.2.0.vert.tar.gz
- Size
- 56.92 GB
- Format
- application/gzip
- Description
- VERT file (445.06 GB uncompressed)
- MD5
- 63bf5ecb1007315c121aea1db345cad2
- CLASSLA-web.bg.2.0.vert
- CLASSLA-web.bg.2.0.registry2 kB
- CLASSLA-web.bg.2.0.vert445 GB
- Name
- CLASSLA-web.bs.2.0.anno.jsonl.gz
- Size
- 16.09 GB
- Format
- application/gzip
- Description
- JSONL file with linguistic annotation (103.26 GB uncompressed)
- MD5
- f6bd4f43c6ab184439de6b6cadb85fc9
- Name
- CLASSLA-web.bs.2.0.jsonl.gz
- Size
- 2.63 GB
- Format
- application/gzip
- Description
- JSONL file (7.02 GB uncompressed)
- MD5
- f92e72f4fe08362a1297f311ac20ad33
- Name
- CLASSLA-web.bs.2.0.vert.tar.gz
- Size
- 9.04 GB
- Format
- application/gzip
- Description
- VERT file (64.87 GB uncompressed)
- MD5
- 8aa9734b5902e542f03969e0dc188570
- CLASSLA-web.bs.2.0.vert
- CLASSLA-web.bs.2.0.registry2 kB
- CLASSLA-web.bs.2.0.vert64 GB
- Name
- CLASSLA-web.cnr.2.0.anno.jsonl.gz
- Size
- 4.72 GB
- Format
- application/gzip
- Description
- JSONL file with linguistic annotation (30.30 GB uncompressed)
- MD5
- f705697eb15ace8b24756a5666386890
- Name
- CLASSLA-web.cnr.2.0.jsonl.gz
- Size
- 802.65 MB
- Format
- application/gzip
- Description
- JSONL file (2.10 GB uncompressed)
- MD5
- fdcf171ea122710559840f75d841b923
- Name
- CLASSLA-web.cnr.2.0.vert.tar.gz
- Size
- 2.65 GB
- Format
- application/gzip
- Description
- VERT file (19.07 GB uncompressed)
- MD5
- a0717d03aa86ba09236712bf547304b5
- CLASSLA-web.cnr.2.0.vert
- CLASSLA-web.cnr.2.0.vert19 GB
- CLASSLA-web.cnr.2.0.registry2 kB
- Name
- CLASSLA-web.hr.2.0.anno.jsonl.gz
- Size
- 48.32 GB
- Format
- application/gzip
- Description
- JSONL file with linguistic annotation (306.42 GB uncompressed)
- MD5
- e276b57005d0d419fd727b38db303ee0
- Name
- CLASSLA-web.hr.2.0.jsonl.gz
- Size
- 7.87 GB
- Format
- application/gzip
- Description
- JSONL file (20.67 GB uncompressed)
- MD5
- 6d2af5fe8e3448de5e2598652048be00
- Name
- CLASSLA-web.hr.2.0.vert.tar.gz
- Size
- 26.77 GB
- Format
- application/gzip
- Description
- VERT file (192.32 GB uncompressed)
- MD5
- b1456e27ffe6c9267f53bf634dfb78f6
- CLASSLA-web.hr.2.0.vert
- CLASSLA-web.hr.2.0.registry2 kB
- CLASSLA-web.hr.2.0.vert192 GB
- Name
- CLASSLA-web.mk.2.0.anno.jsonl.gz
- Size
- 9.15 GB
- Format
- application/gzip
- Description
- JSONL file with linguistic annotation (77.39 GB uncompressed)
- MD5
- 00d09c3e21237329c980141f2afa1582
- Name
- CLASSLA-web.mk.2.0.jsonl.gz
- Size
- 2.01 GB
- Format
- application/gzip
- Description
- JSONL file (8.17 GB uncompressed)
- MD5
- 16f7cbd13503723684556984549506c6
- Name
- CLASSLA-web.mk.2.0.vert.tar.gz
- Size
- 6.17 GB
- Format
- application/gzip
- Description
- VERT file (51.15 GB uncompressed)
- MD5
- c8332aa148efb664e3d97dcf46750e9c
- CLASSLA-web.mk.2.0.vert
- CLASSLA-web.mk.2.0.vert51 GB
- CLASSLA-web.mk.2.0.registry2 kB
- Name
- CLASSLA-web.sl.2.0.anno.jsonl.gz
- Size
- 37.49 GB
- Format
- application/gzip
- Description
- JSONL file with linguistic annotation (239.04 GB uncompressed)
- MD5
- 97d6133551f47c14e36228c3c82a0983
- Name
- CLASSLA-web.sl.2.0.jsonl.gz
- Size
- 5.89 GB
- Format
- application/gzip
- Description
- JSONL file (15.62 GB uncompressed)
- MD5
- ffbdc907e3fb96d20befb93e1b2ec0b2
- Name
- CLASSLA-web.sl.2.0.vert.tar.gz
- Size
- 20.64 GB
- Format
- application/gzip
- Description
- VERT file (148.49 GB uncompressed)
- MD5
- 7be8fa7621d088c2bb69f08e11288eb4
- CLASSLA-web.sl.2.0.vert
- CLASSLA-web.sl.2.0.vert148 GB
- CLASSLA-web.sl.2.0.registry2 kB
- Name
- CLASSLA-web.sr.2.0.anno.jsonl.gz
- Size
- 50.08 GB
- Format
- application/gzip
- Description
- JSONL file with linguistic annotation (346.54 GB uncompressed)
- MD5
- d03c12e6f940413962efac58da3a107f
- Name
- CLASSLA-web.sr.2.0.jsonl.gz
- Size
- 9.66 GB
- Format
- application/gzip
- Description
- JSONL file (24.71 GB uncompressed)
- MD5
- b794cac545b9f1ba28166f45a1f920eb
- Name
- CLASSLA-web.sr.2.0.vert.tar.gz
- Size
- 32.94 GB
- Format
- application/gzip
- Description
- VERT file (236.24 GB uncompressed)
- MD5
- 733e77a3a12f6469e0c4eb53e4269b90
- CLASSLA-web.sr.2.0.vert
- CLASSLA-web.sr.2.0.registry2 kB
- CLASSLA-web.sr.2.0.vert236 GB