Bulgarian web corpus CLASSLA-web.bg 1.0

Name: Bulgarian web corpus CLASSLA-web.bg 1.0
License: https://creativecommons.org/publicdomain/zero/1.0/

Ljubešić, Nikola; Rupnik, Peter; Kuzman, Taja

dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Rupnik, Peter
dc.contributor.author	Kuzman, Taja
dc.date.accessioned	2024-03-26T10:15:12Z
dc.date.available	2024-03-26T10:15:12Z
dc.date.issued	2024-03-26
dc.identifier.uri	http://hdl.handle.net/11356/1928
dc.description	The Bulgarian web corpus CLASSLA-web.bg 1.0 is based on the MaCoCu-bg 2.0 web corpus crawl (http://hdl.handle.net/11356/1800), which was additionally cleaned and enriched with linguistic and genre information. The CLASSLA-web.bg corpus is a part of the South Slavic CLASSLA-web corpus collection, which is the first collection of comparable corpora that encompasses the entire South Slavic language group. The MaCoCu-bg 2.0 crawl was built by crawling the ".bg" and ".бг" internet top-level domains in 2021, as well as extending the crawl dynamically to other domains. During the development of CLASSLA-web corpora, the MaCoCu web crawls were cleaned by removing paragraphs that are not in the target language, and by removing very short texts (less than 75 words or consisting only of paragraphs shorter than 70 characters). The corpus was also linguistically annotated with the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla). The linguistic processing involved tokenization, morphosyntactic annotation, and lemmatization. Additionally, the corpus was automatically annotated with genres using the Transformer-based X-GENRE classifier (https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier). The following genre categories are used: News, Information/Explanation, Promotion, Opinion/Argumentation, Instruction, Legal, Prose/Lyrical, Forum, Other and Mix. The corpus is available in vertical format, as used by Sketch Engine and CWB concordancers. Information is provided on the text-, paragraph-, sentence- and token-level. Each text is accompanied by the following metadata: text id, title, url, domain, top-level domain (tld, e.g., "com"), and predicted genre category. Each text is divided into paragraphs that are accompanied by the following metadata: paragraph id, the automatically identified language of the text in the paragraph, and paragraph quality. For quality, labels, such as "short" or "good" are assigned based on paragraph length, URL and stopword density via the jusText tool (https://corpus.tools/wiki/Justext). Paragraphs are further divided into sentences that have as metadata their sentence id. Inside sentences, tokens are provided in tabular format with their linguistic annotation. Details about the structural and positional attributes are also given in the accompanying registry file which was used to install the corpus on the CLARIN.SI concordancers. In addition, a compressed list of the full URLs from the corpus is available, providing a concise overview of its content. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. A JSONL version of the corpus is available as part of the MaCoCu-Genre corpora collection at http://hdl.handle.net/11356/1969. The MaCoCu-Genre version comprises texts and metadata at the text level, including genre information, and is not linguistically annotated. Please note that this is an earlier version of the CLASSLA-web corpus collection. A newer collection of web texts is available as the CLASSLA-web 2.0 corpus, accessible here: http://hdl.handle.net/11356/2079. The 2.0 release consists of texts collected from the web in 2024 and introduces several enhancements: topic annotation has been added alongside genre annotation, and the corpora are provided in the widely used JSONL and VERT formats. Content overlap analyses indicate that CLASSLA-web 1.0 and CLASSLA-web 2.0 share only around 20% of their content. Consequently, CLASSLA-web 2.0 should be viewed not as a replacement for version 1.0, but as a complementary resource. The two corpus collections can be used together to maximize the amount of available text data.
dc.language.iso	bul
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://aclanthology.org/2024.lrec-main.291/
dc.rights	CC0-No Rights Reserved
dc.rights.uri	https://creativecommons.org/publicdomain/zero/1.0/
dc.rights.label	PUB
dc.source.uri	https://clarinsi.github.io/classla-web/
dc.subject	web corpus
dc.subject	automatic genre identification
dc.subject	genre corpus
dc.title	Bulgarian web corpus CLASSLA-web.bg 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
demo.uri	https://www.clarin.si/info/k-centre/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers/
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
size.info	3917596458 tokens
size.info	3249563781 words
size.info	7456727 texts
files.count	3
files.size	34600133507
featuredService.noske	search\|https://www.clarin.si/ske/#dashboard?corpname=classlaweb_bg