Prikaži enostavni zapis vnosa

 
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2021-05-06T07:44:43Z
dc.date.available 2021-05-06T07:44:43Z
dc.date.issued 2021-05-05
dc.identifier.uri http://hdl.handle.net/11356/1426
dc.description The BERTić-data text collection contains more than 8 billion tokens of mostly web-crawled text written in Bosnian, Croatian, Montenegrin or Serbian. The collection was used to train the BERTić transformer model (https://huggingface.co/classla/bcms-bertic). The data consists of web crawls before 2015, i.e. bsWaC (http://hdl.handle.net/11356/1062), hrWaC (http://hdl.handle.net/11356/1064), and srWaC (http://hdl.handle.net/11356/1063); previously unpublished 2019-2020 crawls, i.e. cnrWaC, CLASSLA-bs, CLASSLA-hr, and CLASSLA-sr; the cc100-hr and cc100-sr parts of CommonCrawl (https://commoncrawl.org/); and the Riznica corpus (http://hdl.handle.net/11356/1180). All texts were transliterated to the Latin script. The format of the text collection is one-sentence-per-line, empty-line-as-document-boundary. More details, especially on the applied near-deduplication procedure, can be found in the BERTić paper (https://arxiv.org/pdf/2104.09243.pdf).
dc.language.iso bos
dc.language.iso hrv
dc.language.iso cnr
dc.language.iso srp
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://arxiv.org/abs/2104.09243
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://huggingface.co/classla/bcms-bertic
dc.subject web corpus
dc.subject language model
dc.title Text collection for training the BERTić transformer model BERTić-data
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
demo.uri https://huggingface.co/classla/bcms-bertic-ner
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor European Union’s Rights,Equality and Citizenship Programme 875263 IMSyPP - Innovative Monitoring Systems and PreventionPolicies of Online Hate Speech Other
sponsor ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds
sponsor ARRS (Slovenian Research Agency) J7-8280 FRENK: Resources, methods, and tools for the understanding, identification, and classification of various forms of socially unacceptable discourse in the information society nationalFunds
size.info 8387681518 words
files.count 10
files.size 22694988939


 Datoteke v tem vnosu

Icon
Ime
bswac.gz
Velikost
654.71 MB
Format
application/gzip
Opis
Web-as-Corpus 2014 Bosnian texts
MD5
f40da84f9e1a4049f589480b2c6f647c
 Prenesi datoteko
Icon
Ime
hrwac.gz
Velikost
3.1 GB
Format
application/gzip
Opis
Web-as-Corpus 2011 and 2014 Croatian texts
MD5
7306c700626c5b9a803879fd96da2d21
 Prenesi datoteko
Icon
Ime
cnrwac.gz
Velikost
214.97 MB
Format
application/gzip
Opis
Web-as-Corpus 2019 Montenegrin texts
MD5
3169e200a089b9dd7214096bec8c0376
 Prenesi datoteko
Icon
Ime
srwac.gz
Velikost
1.21 GB
Format
application/gzip
Opis
Web-as-Corpus 2014 Serbian texts
MD5
66f3a5e31b2c38215c8f4eddd7df6903
 Prenesi datoteko
Icon
Ime
cc100-hr.gz
Velikost
7.49 GB
Format
application/gzip
Opis
Common Crawl Croatian texts
MD5
a9702d6d1f6dd2bb34de5acb0f45f3aa
 Prenesi datoteko
Icon
Ime
cc100-sr.gz
Velikost
1.79 GB
Format
application/gzip
Opis
Common Crawl Serbian texts
MD5
c7e9b916c22a5ecc13f65d657b6815f2
 Prenesi datoteko
Icon
Ime
classla-bs.gz
Velikost
1.31 GB
Format
application/gzip
Opis
Web-as-Corpus 2020 Bosnian texts
MD5
ddffa3382d25409006b20338bb8c8e7b
 Prenesi datoteko
Icon
Ime
classla-hr.gz
Velikost
3.32 GB
Format
application/gzip
Opis
Web-as-Corpus 2020 Croatian texts
MD5
967c44708a26cda5003696192793aab4
 Prenesi datoteko
Icon
Ime
classla-sr.gz
Velikost
1.85 GB
Format
application/gzip
Opis
Web-as-Corpus 2020 Serbian texts
MD5
4ecf85bd53ba955ee5d2bbbf73a7d0f6
 Prenesi datoteko
Icon
Ime
riznica.gz
Velikost
217.57 MB
Format
application/gzip
Opis
Croatian newspaper and literary texts
MD5
df60a9569317ce390be0424604a446f1
 Prenesi datoteko

Prikaži enostavni zapis vnosa