Prikaži enostavni zapis vnosa

 
dc.contributor.author Kuzman Pungeršek, Taja
dc.contributor.author Rupnik, Peter
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2026-01-27T11:06:09Z
dc.date.available 2026-01-27T11:06:09Z
dc.date.issued 2026-01-27
dc.identifier.uri http://hdl.handle.net/11356/2079
dc.description The CLASSLA-web 2.0 collection is a large-scale, comparable set of web corpora covering all seven South Slavic languages: Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian. This second major CLASSLA-web release follows the methodology of the CLASSLA-web 1.0 corpus collection while providing more recent texts and additional annotation layers, including automatic topic annotation alongside genre classification. The collection comprises approximately 17 billion words across 38 million texts: 2.31B words in the Slovenian corpus, 3.01B in the Croatian corpus, 1.01B in the Bosnian corpus, 294M in the Montenegrin corpus, 3.71B in the Serbian corpus, 691M in the Macedonian corpus, and 5.99B words in the Bulgarian corpus. Detailed size statistics for each corpus are provided in the accompanying README file. Each corpus in the CLASSLA-web 2.0 collection is based on dedicated web crawls of the corresponding national top-level domains (TLDs) and connected general domains (e.g. .com), namely, .si for Slovenian, .hr for Croatian, .ba for Bosnian, .me for Montenegrin, .rs and .срб for Serbian, .mk and .мкд for Macedonian, and .bg and .бг for Bulgarian. All texts were collected in 2024. The corpora are linguistically annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla). Linguistic processing included tokenization, morphosyntactic annotation, and lemmatization. Each corpus was further automatically annotated with genre labels using the X-GENRE classifier (http://doi.org/10.57967/hf/0927) and with topic labels using the IPTC news topic classifier (http://doi.org/10.57967/hf/4709). Additional details on corpus construction are available at https://clarinsi.github.io/classla-web/. The CLASSLA-web 2.0 corpora are distributed in two complementary formats. In JSONL format, each web document is represented in a single line containing a complete JSON object with document-level metadata and full text, enabling efficient line-by-line processing of large datasets. This format is primarily intended for downloading, filtering, and offline processing. Two JSONL files are provided for each corpus, with the suffixes .jsonl and .anno.jsonl. Both files are identical, however, the .anno.jsonl version additionally includes linguistically annotated text in CoNLL-U format. The second format is the so called vertical format (VERT): a vertically tokenized, XML-like representation that integrates document-, paragraph-, sentence-, and token-level information together with linguistic annotation, and can be used by (no)Sketch Engine and CWB concordancers. The provided document-level metadata in both formats include document ID, title, URL, domain, top-level domain (tld), language, script (Latin or Cyrillic, applicable to the Bosnian, Croatian, Montenegrin, and Serbian corpora), year of crawling, and predicted genre and topic categories. Further details on metadata attributes and formats are provided in the accompanying README file. Compared to CLASSLA-web 1.0 (collected in 2021–2022), the new release provides a substantially larger and more recent snapshot of web content, with only about 20 percent textual overlap between the two versions. The new release additionally includes topic annotations alongside genre labels and is distributed in the widely used JSONL and VERT formats. The CLASSLA-web 1.0 corpora were published as separate entries, namely Bosnian (https://hdl.handle.net/11356/1927), Bulgarian (https://hdl.handle.net/11356/1928), Croatian (https://hdl.handle.net/11356/1929), Macedonian (https://hdl.handle.net/11356/1932), Montenegrin (https://hdl.handle.net/11356/1930), Serbian (https://hdl.handle.net/11356/1931) and Slovenian (https://hdl.handle.net/11356/1882). Notice and take down: Should you consider that our data contains material that is owned by you and should not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus.
dc.language.iso bos
dc.language.iso bul
dc.language.iso hrv
dc.language.iso mkd
dc.language.iso cnr
dc.language.iso srp
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://doi.org/10.48550/arXiv.2601.11170
dc.rights CC0-No Rights Reserved
dc.rights.uri https://creativecommons.org/publicdomain/zero/1.0/
dc.rights.label PUB
dc.source.uri https://clarinsi.github.io/classla-web/
dc.subject web corpus
dc.subject automatic genre identification
dc.subject genre corpus
dc.subject web crawling
dc.subject web
dc.subject topic classification
dc.subject topic
dc.title South Slavic web corpus collection CLASSLA-web 2.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Taja Kuzman Pungeršek taja.kuzman@ijs.si Jožef Stefan Institute
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
size.info 17010802368 words
size.info 38057171 texts
files.count 22
files.size 488068698480
featuredService.noske Bosnian|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_bs
featuredService.noske Bulgarian |https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_bg
featuredService.noske Croatian|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_hr
featuredService.noske Macedonian|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_mk
featuredService.noske Montenegrin|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_cnr
featuredService.noske Serbian|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_sr
featuredService.noske Slovenian|https://www.clarin.si/ske/#dashboard?corpname=classlaweb2_sl


 Datoteke v tem vnosu

To je vnos
Publicly Available
z licenco:
CC0-No Rights Reserved
Icon
Ime
README.md
Velikost
40.77 KB
Format
Neznano
Opis
Documentation on corpora size, format and content
MD5
efe94358cbd6e33b4152b549d13b9cac
 Prenesi datoteko
Icon
Ime
CLASSLA-web.bg.2.0.anno.jsonl.gz
Velikost
86.63 GB
Format
application/gzip
Opis
JSONL file with linguistic annotation (672.85 GB uncompressed)
MD5
03b4664a3c8403c2fd955561de535329
 Prenesi datoteko
Icon
Ime
CLASSLA-web.bg.2.0.jsonl.gz
Velikost
18.08 GB
Format
application/gzip
Opis
JSONL file (68.09 GB uncompressed)
MD5
cfba393ca4503831f7b956a18f43f411
 Prenesi datoteko
Icon
Ime
CLASSLA-web.bg.2.0.vert.tar.gz
Velikost
56.92 GB
Format
application/gzip
Opis
VERT file (445.06 GB uncompressed)
MD5
63bf5ecb1007315c121aea1db345cad2
 Prenesi datoteko  Predogled
 Predogled datoteke  
Icon
Ime
CLASSLA-web.bs.2.0.anno.jsonl.gz
Velikost
16.09 GB
Format
application/gzip
Opis
JSONL file with linguistic annotation (103.26 GB uncompressed)
MD5
f6bd4f43c6ab184439de6b6cadb85fc9
 Prenesi datoteko
Icon
Ime
CLASSLA-web.bs.2.0.jsonl.gz
Velikost
2.63 GB
Format
application/gzip
Opis
JSONL file (7.02 GB uncompressed)
MD5
f92e72f4fe08362a1297f311ac20ad33
 Prenesi datoteko
Icon
Ime
CLASSLA-web.bs.2.0.vert.tar.gz
Velikost
9.04 GB
Format
application/gzip
Opis
VERT file (64.87 GB uncompressed)
MD5
8aa9734b5902e542f03969e0dc188570
 Prenesi datoteko  Predogled
 Predogled datoteke  
Icon
Ime
CLASSLA-web.cnr.2.0.anno.jsonl.gz
Velikost
4.72 GB
Format
application/gzip
Opis
JSONL file with linguistic annotation (30.30 GB uncompressed)
MD5
f705697eb15ace8b24756a5666386890
 Prenesi datoteko
Icon
Ime
CLASSLA-web.cnr.2.0.jsonl.gz
Velikost
802.65 MB
Format
application/gzip
Opis
JSONL file (2.10 GB uncompressed)
MD5
fdcf171ea122710559840f75d841b923
 Prenesi datoteko
Icon
Ime
CLASSLA-web.cnr.2.0.vert.tar.gz
Velikost
2.65 GB
Format
application/gzip
Opis
VERT file (19.07 GB uncompressed)
MD5
a0717d03aa86ba09236712bf547304b5
 Prenesi datoteko  Predogled
 Predogled datoteke  
Icon
Ime
CLASSLA-web.hr.2.0.anno.jsonl.gz
Velikost
48.32 GB
Format
application/gzip
Opis
JSONL file with linguistic annotation (306.42 GB uncompressed)
MD5
e276b57005d0d419fd727b38db303ee0
 Prenesi datoteko
Icon
Ime
CLASSLA-web.hr.2.0.jsonl.gz
Velikost
7.87 GB
Format
application/gzip
Opis
JSONL file (20.67 GB uncompressed)
MD5
6d2af5fe8e3448de5e2598652048be00
 Prenesi datoteko
Icon
Ime
CLASSLA-web.hr.2.0.vert.tar.gz
Velikost
26.77 GB
Format
application/gzip
Opis
VERT file (192.32 GB uncompressed)
MD5
b1456e27ffe6c9267f53bf634dfb78f6
 Prenesi datoteko  Predogled
 Predogled datoteke  
Icon
Ime
CLASSLA-web.mk.2.0.anno.jsonl.gz
Velikost
9.15 GB
Format
application/gzip
Opis
JSONL file with linguistic annotation (77.39 GB uncompressed)
MD5
00d09c3e21237329c980141f2afa1582
 Prenesi datoteko
Icon
Ime
CLASSLA-web.mk.2.0.jsonl.gz
Velikost
2.01 GB
Format
application/gzip
Opis
JSONL file (8.17 GB uncompressed)
MD5
16f7cbd13503723684556984549506c6
 Prenesi datoteko
Icon
Ime
CLASSLA-web.mk.2.0.vert.tar.gz
Velikost
6.17 GB
Format
application/gzip
Opis
VERT file (51.15 GB uncompressed)
MD5
c8332aa148efb664e3d97dcf46750e9c
 Prenesi datoteko  Predogled
 Predogled datoteke  
Icon
Ime
CLASSLA-web.sl.2.0.anno.jsonl.gz
Velikost
37.49 GB
Format
application/gzip
Opis
JSONL file with linguistic annotation (239.04 GB uncompressed)
MD5
97d6133551f47c14e36228c3c82a0983
 Prenesi datoteko
Icon
Ime
CLASSLA-web.sl.2.0.jsonl.gz
Velikost
5.89 GB
Format
application/gzip
Opis
JSONL file (15.62 GB uncompressed)
MD5
ffbdc907e3fb96d20befb93e1b2ec0b2
 Prenesi datoteko
Icon
Ime
CLASSLA-web.sl.2.0.vert.tar.gz
Velikost
20.64 GB
Format
application/gzip
Opis
VERT file (148.49 GB uncompressed)
MD5
7be8fa7621d088c2bb69f08e11288eb4
 Prenesi datoteko  Predogled
 Predogled datoteke  
Icon
Ime
CLASSLA-web.sr.2.0.anno.jsonl.gz
Velikost
50.08 GB
Format
application/gzip
Opis
JSONL file with linguistic annotation (346.54 GB uncompressed)
MD5
d03c12e6f940413962efac58da3a107f
 Prenesi datoteko
Icon
Ime
CLASSLA-web.sr.2.0.jsonl.gz
Velikost
9.66 GB
Format
application/gzip
Opis
JSONL file (24.71 GB uncompressed)
MD5
b794cac545b9f1ba28166f45a1f920eb
 Prenesi datoteko
Icon
Ime
CLASSLA-web.sr.2.0.vert.tar.gz
Velikost
32.94 GB
Format
application/gzip
Opis
VERT file (236.24 GB uncompressed)
MD5
733e77a3a12f6469e0c4eb53e4269b90
 Prenesi datoteko  Predogled
 Predogled datoteke  

Prikaži enostavni zapis vnosa