Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2021-05-06T07:44:43Z
dc.date.available 2021-05-06T07:44:43Z
dc.date.issued 2021-05-05
dc.identifier.uri http://hdl.handle.net/11356/1426
dc.description The BERTić-data text collection contains more than 8 billion tokens of mostly web-crawled text written in Bosnian, Croatian, Montenegrin or Serbian. The collection was used to train the BERTić transformer model (https://huggingface.co/classla/bcms-bertic). The data consists of web crawls before 2015, i.e. bsWaC (http://hdl.handle.net/11356/1062), hrWaC (http://hdl.handle.net/11356/1064), and srWaC (http://hdl.handle.net/11356/1063); previously unpublished 2019-2020 crawls, i.e. cnrWaC, CLASSLA-bs, CLASSLA-hr, and CLASSLA-sr; the cc100-hr and cc100-sr parts of CommonCrawl (https://commoncrawl.org/); and the Riznica corpus (http://hdl.handle.net/11356/1180). All texts were transliterated to the Latin script. The format of the text collection is one-sentence-per-line, empty-line-as-document-boundary. More details, especially on the applied near-deduplication procedure, can be found in the BERTić paper (https://arxiv.org/pdf/2104.09243.pdf).
dc.language.iso bos
dc.language.iso hrv
dc.language.iso cnr
dc.language.iso srp
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://arxiv.org/abs/2104.09243
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://huggingface.co/classla/bcms-bertic
dc.subject web corpus
dc.subject language model
dc.title Text collection for training the BERTić transformer model BERTić-data
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
demo.uri https://huggingface.co/classla/bcms-bertic-ner
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor European Union’s Rights,Equality and Citizenship Programme 875263 IMSyPP - Innovative Monitoring Systems and PreventionPolicies of Online Hate Speech Other
sponsor ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds
sponsor ARRS (Slovenian Research Agency) J7-8280 FRENK: Resources, methods, and tools for the understanding, identification, and classification of various forms of socially unacceptable discourse in the information society nationalFunds
size.info 8387681518 words
files.count 10
files.size 22694988939


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
bswac.gz
Size
654.71 MB
Format
application/gzip
Description
Web-as-Corpus 2014 Bosnian texts
MD5
f40da84f9e1a4049f589480b2c6f647c
 Download file
Icon
Name
hrwac.gz
Size
3.1 GB
Format
application/gzip
Description
Web-as-Corpus 2011 and 2014 Croatian texts
MD5
7306c700626c5b9a803879fd96da2d21
 Download file
Icon
Name
cnrwac.gz
Size
214.97 MB
Format
application/gzip
Description
Web-as-Corpus 2019 Montenegrin texts
MD5
3169e200a089b9dd7214096bec8c0376
 Download file
Icon
Name
srwac.gz
Size
1.21 GB
Format
application/gzip
Description
Web-as-Corpus 2014 Serbian texts
MD5
66f3a5e31b2c38215c8f4eddd7df6903
 Download file
Icon
Name
cc100-hr.gz
Size
7.49 GB
Format
application/gzip
Description
Common Crawl Croatian texts
MD5
a9702d6d1f6dd2bb34de5acb0f45f3aa
 Download file
Icon
Name
cc100-sr.gz
Size
1.79 GB
Format
application/gzip
Description
Common Crawl Serbian texts
MD5
c7e9b916c22a5ecc13f65d657b6815f2
 Download file
Icon
Name
classla-bs.gz
Size
1.31 GB
Format
application/gzip
Description
Web-as-Corpus 2020 Bosnian texts
MD5
ddffa3382d25409006b20338bb8c8e7b
 Download file
Icon
Name
classla-hr.gz
Size
3.32 GB
Format
application/gzip
Description
Web-as-Corpus 2020 Croatian texts
MD5
967c44708a26cda5003696192793aab4
 Download file
Icon
Name
classla-sr.gz
Size
1.85 GB
Format
application/gzip
Description
Web-as-Corpus 2020 Serbian texts
MD5
4ecf85bd53ba955ee5d2bbbf73a7d0f6
 Download file
Icon
Name
riznica.gz
Size
217.57 MB
Format
application/gzip
Description
Croatian newspaper and literary texts
MD5
df60a9569317ce390be0424604a446f1
 Download file

Show simple item record