South Slavic web corpus collection CLASSLA-web 2.0

Kuzman Pungeršek, Taja; Rupnik, Peter; Ljubešić, Nikola

dc.contributor.author	Kuzman Pungeršek, Taja
dc.contributor.author	Rupnik, Peter
dc.contributor.author	Ljubešić, Nikola
dc.date.accessioned	2026-01-27T11:06:09Z
dc.date.available	2026-01-27T11:06:09Z
dc.date.issued	2026-01-27
dc.identifier.uri	http://hdl.handle.net/11356/2079
dc.description	The CLASSLA-web 2.0 collection is a large-scale, comparable set of web corpora covering all seven South Slavic languages: Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian. This second major CLASSLA-web release follows the methodology of the CLASSLA-web 1.0 corpus collection while providing more recent texts and additional annotation layers, including automatic topic annotation alongside genre classification. The collection comprises approximately 17 billion words across 38 million texts: 2.31B words in the Slovenian corpus, 3.01B in the Croatian corpus, 1.01B in the Bosnian corpus, 294M in the Montenegrin corpus, 3.71B in the Serbian corpus, 691M in the Macedonian corpus, and 5.99B words in the Bulgarian corpus. Detailed size statistics for each corpus are provided in the accompanying README file. Each corpus in the CLASSLA-web 2.0 collection is based on dedicated web crawls of the corresponding national top-level domains (TLDs) and connected general domains (e.g. .com), namely, .si for Slovenian, .hr for Croatian, .ba for Bosnian, .me for Montenegrin, .rs and .срб for Serbian, .mk and .мкд for Macedonian, and .bg and .бг for Bulgarian. All texts were collected in 2024. The corpora are linguistically annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla). Linguistic processing included tokenization, morphosyntactic annotation, and lemmatization. Each corpus was further automatically annotated with genre labels using the X-GENRE classifier (http://doi.org/10.57967/hf/0927) and with topic labels using the IPTC news topic classifier (http://doi.org/10.57967/hf/4709). Additional details on corpus construction are available at https://clarinsi.github.io/classla-web/. The CLASSLA-web 2.0 corpora are distributed in two complementary formats. In JSONL format, each web document is represented in a single line containing a complete JSON object with document-level metadata and full text, enabling efficient line-by-line processing of large datasets. This format is primarily intended for downloading, filtering, and offline processing. Two JSONL files are provided for each corpus, with the suffixes .jsonl and .anno.jsonl. Both files are identical, however, the .anno.jsonl version additionally includes linguistically annotated text in CoNLL-U format. The second format is the so called vertical format (VERT): a vertically tokenized, XML-like representation that integrates document-, paragraph-, sentence-, and token-level information together with linguistic annotation, and can be used by (no)Sketch Engine and CWB concordancers. The provided document-level metadata in both formats include document ID, title, URL, domain, top-level domain (tld), language, script (Latin or Cyrillic, applicable to the Bosnian, Croatian, Montenegrin, and Serbian corpora), year of crawling, and predicted genre and topic categories. Further details on metadata attributes and formats are provided in the accompanying README file. In addition, compressed lists of full URLs for each web corpus are available, offering a concise overview of the corpora’s content. Compared to CLASSLA-web 1.0 (collected in 2021–2022), the new release provides a substantially larger and more recent snapshot of web content, with only about 20 percent textual overlap between the two versions. The new release additionally includes topic annotations alongside genre labels and is distributed in the widely used JSONL and VERT formats. The CLASSLA-web 1.0 corpora were published as separate entries, namely Bosnian (https://hdl.handle.net/11356/1927), Bulgarian (https://hdl.handle.net/11356/1928), Croatian (https://hdl.handle.net/11356/1929), Macedonian (https://hdl.handle.net/11356/1932), Montenegrin (https://hdl.handle.net/11356/1930), Serbian (https://hdl.handle.net/11356/1931) and Slovenian (https://hdl.handle.net/11356/1882). Notice and take down: Should you consider that our data contains material that is owned by you and should not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus.
dc.language.iso	bos
dc.language.iso	bul
dc.language.iso	hrv
dc.language.iso	mkd
dc.language.iso	cnr
dc.language.iso	srp
dc.language.iso	slv
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://doi.org/10.48550/arXiv.2601.11170
dc.rights	CC0-No Rights Reserved
dc.rights.uri	https://creativecommons.org/publicdomain/zero/1.0/
dc.rights.label	PUB
dc.source.uri	https://clarinsi.github.io/classla-web/
dc.subject	web corpus
dc.subject	automatic genre identification
dc.subject	genre corpus
dc.subject	web crawling
dc.subject	web
dc.subject	topic classification
dc.subject	topic
dc.title	South Slavic web corpus collection CLASSLA-web 2.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Taja Kuzman Pungeršek taja.kuzman@ijs.si Jožef Stefan Institute
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
size.info	17010802368 words
size.info	38057171 texts
files.count	29
files.size	488844235541
featuredService.noske	Bosnian\|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_bs
featuredService.noske	Bulgarian \|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_bg
featuredService.noske	Croatian\|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_hr
featuredService.noske	Macedonian\|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_mk
featuredService.noske	Montenegrin\|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_cnr
featuredService.noske	Serbian\|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_sr
featuredService.noske	Slovenian\|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_sl

Files in this item

This item is

Publicly Available

and licensed under:
CC0-No Rights Reserved

Name: README.md
Size: 40.77 KB
Format: Unknown
Description: Documentation on corpora size, format and content
MD5: efe94358cbd6e33b4152b549d13b9cac

Download file

Name: CLASSLA-web.bg.2.0.anno.jsonl.gz
Size: 86.63 GB
Format: application/gzip
Description: CLASSLA-web.bg 2.0 JSONL file with linguistic annotation (672.85 GB uncompressed)
MD5: 03b4664a3c8403c2fd955561de535329

Download file

Name: CLASSLA-web.bg.2.0.jsonl.gz
Size: 18.08 GB
Format: application/gzip
Description: CLASSLA-web.bg 2.0 JSONL file (68.09 GB uncompressed)
MD5: cfba393ca4503831f7b956a18f43f411

Download file

Name: CLASSLA-web.bg.2.0.vert.tar.gz
Size: 56.92 GB
Format: application/gzip
Description: CLASSLA-web.bg 2.0 VERT file (445.06 GB uncompressed)
MD5: 63bf5ecb1007315c121aea1db345cad2

Download file Preview

File Preview

CLASSLA-web.bg.2.0.vert
- CLASSLA-web.bg.2.0.registry2 kB
- CLASSLA-web.bg.2.0.vert445 GB

Name: CLASSLA-web.bg.2.0.urls.zip
Size: 279.1 MB
Format: application/zip
Description: List of full URLs from CLASSLA-web.bg 2.0
MD5: 7cb8a0148b744b5bfbcb448cf4a88300

Download file Preview

File Preview

- CLASSLA-web.bg.2.0.urls.json1 GB

Name: CLASSLA-web.bs.2.0.anno.jsonl.gz
Size: 16.09 GB
Format: application/gzip
Description: CLASSLA-web.bs 2.0 JSONL file with linguistic annotation (103.26 GB uncompressed)
MD5: f6bd4f43c6ab184439de6b6cadb85fc9

Download file

Name: CLASSLA-web.bs.2.0.jsonl.gz
Size: 2.63 GB
Format: application/gzip
Description: CLASSLA-web.bs 2.0 JSONL file (7.02 GB uncompressed)
MD5: f92e72f4fe08362a1297f311ac20ad33

Download file

Name: CLASSLA-web.bs.2.0.vert.tar.gz
Size: 9.04 GB
Format: application/gzip
Description: CLASSLA-web.bs 2.0 VERT file (64.87 GB uncompressed)
MD5: 8aa9734b5902e542f03969e0dc188570

Download file Preview

File Preview

CLASSLA-web.bs.2.0.vert
- CLASSLA-web.bs.2.0.registry2 kB
- CLASSLA-web.bs.2.0.vert64 GB

Name: CLASSLA-web.bs.2.0.urls.zip
Size: 54.45 MB
Format: application/zip
Description: List of full URLs from CLASSLA-web.bs 2.0
MD5: 4ed52f29cd837607525ddf597d45f495

Download file Preview

File Preview

- CLASSLA-web.bs.2.0.urls.json226 MB

Name: CLASSLA-web.cnr.2.0.anno.jsonl.gz
Size: 4.72 GB
Format: application/gzip
Description: CLASSLA-web.cnr 2.0 JSONL file with linguistic annotation (30.30 GB uncompressed)
MD5: f705697eb15ace8b24756a5666386890

Download file

Name: CLASSLA-web.cnr.2.0.jsonl.gz
Size: 802.65 MB
Format: application/gzip
Description: CLASSLA-web.cnr 2.0 JSONL file (2.10 GB uncompressed)
MD5: fdcf171ea122710559840f75d841b923

Download file

Name: CLASSLA-web.cnr.2.0.vert.tar.gz
Size: 2.65 GB
Format: application/gzip
Description: CLASSLA-web.cnr 2.0 VERT file (19.07 GB uncompressed)
MD5: a0717d03aa86ba09236712bf547304b5

Download file Preview

File Preview

CLASSLA-web.cnr.2.0.vert
- CLASSLA-web.cnr.2.0.vert19 GB
- CLASSLA-web.cnr.2.0.registry2 kB

Name: CLASSLA-web.cnr.2.0.urls.zip
Size: 17.36 MB
Format: application/zip
Description: List of full URLs from CLASSLA-web.cnr 2.0
MD5: 9166d0caa7468f786bb7d8b890cc2a76

Download file Preview

File Preview

- CLASSLA-web.cnr.2.0.urls.json73 MB

Name: CLASSLA-web.hr.2.0.anno.jsonl.gz
Size: 48.32 GB
Format: application/gzip
Description: CLASSLA-web.hr 2.0 JSONL file with linguistic annotation (306.42 GB uncompressed)
MD5: e276b57005d0d419fd727b38db303ee0

Download file

Name: CLASSLA-web.hr.2.0.jsonl.gz
Size: 7.87 GB
Format: application/gzip
Description: CLASSLA-web.hr 2.0 JSONL file (20.67 GB uncompressed)
MD5: 6d2af5fe8e3448de5e2598652048be00

Download file

Name: CLASSLA-web.hr.2.0.vert.tar.gz
Size: 26.77 GB
Format: application/gzip
Description: CLASSLA-web.hr 2.0 VERT file (192.32 GB uncompressed)
MD5: b1456e27ffe6c9267f53bf634dfb78f6

Download file Preview

File Preview

CLASSLA-web.hr.2.0.vert
- CLASSLA-web.hr.2.0.registry2 kB
- CLASSLA-web.hr.2.0.vert192 GB

Name: CLASSLA-web.hr.2.0.urls.zip
Size: 115.44 MB
Format: application/zip
Description: List of full URLs from CLASSLA-web.hr 2.0
MD5: 66eaa380f124e837fca7e5f4ebc5ce04

Download file Preview

File Preview

- CLASSLA-web.hr.2.0.urls.json530 MB

Name: CLASSLA-web.mk.2.0.anno.jsonl.gz
Size: 9.15 GB
Format: application/gzip
Description: CLASSLA-web.mk 2.0 JSONL file with linguistic annotation (77.39 GB uncompressed)
MD5: 00d09c3e21237329c980141f2afa1582

Download file

Name: CLASSLA-web.mk.2.0.jsonl.gz
Size: 2.01 GB
Format: application/gzip
Description: CLASSLA-web.mk 2.0 JSONL file (8.17 GB uncompressed)
MD5: 16f7cbd13503723684556984549506c6

Download file

Name: CLASSLA-web.mk.2.0.vert.tar.gz
Size: 6.17 GB
Format: application/gzip
Description: CLASSLA-web.mk 2.0 VERT file (51.15 GB uncompressed)
MD5: c8332aa148efb664e3d97dcf46750e9c

Download file Preview

File Preview

CLASSLA-web.mk.2.0.vert
- CLASSLA-web.mk.2.0.vert51 GB
- CLASSLA-web.mk.2.0.registry2 kB

Name: CLASSLA-web.mk.2.0.urls.zip
Size: 49.53 MB
Format: application/zip
Description: List of full URLs from CLASSLA-web.mk 2.0
MD5: 14fd8d6b42b409b51634ff79fd73423d

Download file Preview

File Preview

- CLASSLA-web.mk.2.0.urls.json238 MB

Name: CLASSLA-web.sl.2.0.anno.jsonl.gz
Size: 37.49 GB
Format: application/gzip
Description: CLASSLA-web.sl 2.0 JSONL file with linguistic annotation (239.04 GB uncompressed)
MD5: 97d6133551f47c14e36228c3c82a0983

Download file

Name: CLASSLA-web.sl.2.0.jsonl.gz
Size: 5.89 GB
Format: application/gzip
Description: CLASSLA-web.sl 2.0 JSONL file (15.62 GB uncompressed)
MD5: ffbdc907e3fb96d20befb93e1b2ec0b2

Download file

Name: CLASSLA-web.sl.2.0.vert.tar.gz
Size: 20.64 GB
Format: application/gzip
Description: CLASSLA-web.sl 2.0 VERT file (148.49 GB uncompressed)
MD5: 7be8fa7621d088c2bb69f08e11288eb4

Download file Preview

File Preview

CLASSLA-web.sl.2.0.vert
- CLASSLA-web.sl.2.0.vert148 GB
- CLASSLA-web.sl.2.0.registry2 kB

Name: CLASSLA-web.sl.2.0.urls.zip
Size: 78.87 MB
Format: application/zip
Description: List of full URLs from CLASSLA-web.sl 2.0
MD5: 089d17863ef4431de4204e0ec1aacf60

Download file Preview

File Preview

- CLASSLA-web.sl.2.0.urls.json414 MB

Name: CLASSLA-web.sr.2.0.anno.jsonl.gz
Size: 50.08 GB
Format: application/gzip
Description: CLASSLA-web.sr 2.0 JSONL file with linguistic annotation (346.54 GB uncompressed)
MD5: d03c12e6f940413962efac58da3a107f

Download file