dc.contributor.author | Ljubešić, Nikola |
dc.date.accessioned | 2021-05-06T07:44:43Z |
dc.date.available | 2021-05-06T07:44:43Z |
dc.date.issued | 2021-05-05 |
dc.identifier.uri | http://hdl.handle.net/11356/1426 |
dc.description | The BERTić-data text collection contains more than 8 billion tokens of mostly web-crawled text written in Bosnian, Croatian, Montenegrin or Serbian. The collection was used to train the BERTić transformer model (https://huggingface.co/classla/bcms-bertic). The data consists of web crawls before 2015, i.e. bsWaC (http://hdl.handle.net/11356/1062), hrWaC (http://hdl.handle.net/11356/1064), and srWaC (http://hdl.handle.net/11356/1063); previously unpublished 2019-2020 crawls, i.e. cnrWaC, CLASSLA-bs, CLASSLA-hr, and CLASSLA-sr; the cc100-hr and cc100-sr parts of CommonCrawl (https://commoncrawl.org/); and the Riznica corpus (http://hdl.handle.net/11356/1180). All texts were transliterated to the Latin script. The format of the text collection is one-sentence-per-line, empty-line-as-document-boundary. More details, especially on the applied near-deduplication procedure, can be found in the BERTić paper (https://arxiv.org/pdf/2104.09243.pdf). |
dc.language.iso | bos |
dc.language.iso | hrv |
dc.language.iso | cnr |
dc.language.iso | srp |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://arxiv.org/abs/2104.09243 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://huggingface.co/classla/bcms-bertic |
dc.subject | web corpus |
dc.subject | language model |
dc.title | Text collection for training the BERTić transformer model BERTić-data |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
demo.uri | https://huggingface.co/classla/bcms-bertic-ner |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | European Union’s Rights,Equality and Citizenship Programme 875263 IMSyPP - Innovative Monitoring Systems and PreventionPolicies of Online Hate Speech Other |
sponsor | ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds |
sponsor | ARRS (Slovenian Research Agency) J7-8280 FRENK: Resources, methods, and tools for the understanding, identification, and classification of various forms of socially unacceptable discourse in the information society nationalFunds |
size.info | 8387681518 words |
files.count | 10 |
files.size | 22694988939 |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Ime
- bswac.gz
- Velikost
- 654.71 MB
- Format
- application/gzip
- Opis
- Web-as-Corpus 2014 Bosnian texts
- MD5
- f40da84f9e1a4049f589480b2c6f647c

- Ime
- hrwac.gz
- Velikost
- 3.1 GB
- Format
- application/gzip
- Opis
- Web-as-Corpus 2011 and 2014 Croatian texts
- MD5
- 7306c700626c5b9a803879fd96da2d21

- Ime
- cnrwac.gz
- Velikost
- 214.97 MB
- Format
- application/gzip
- Opis
- Web-as-Corpus 2019 Montenegrin texts
- MD5
- 3169e200a089b9dd7214096bec8c0376

- Ime
- srwac.gz
- Velikost
- 1.21 GB
- Format
- application/gzip
- Opis
- Web-as-Corpus 2014 Serbian texts
- MD5
- 66f3a5e31b2c38215c8f4eddd7df6903

- Ime
- cc100-hr.gz
- Velikost
- 7.49 GB
- Format
- application/gzip
- Opis
- Common Crawl Croatian texts
- MD5
- a9702d6d1f6dd2bb34de5acb0f45f3aa

- Ime
- cc100-sr.gz
- Velikost
- 1.79 GB
- Format
- application/gzip
- Opis
- Common Crawl Serbian texts
- MD5
- c7e9b916c22a5ecc13f65d657b6815f2

- Ime
- classla-bs.gz
- Velikost
- 1.31 GB
- Format
- application/gzip
- Opis
- Web-as-Corpus 2020 Bosnian texts
- MD5
- ddffa3382d25409006b20338bb8c8e7b

- Ime
- classla-hr.gz
- Velikost
- 3.32 GB
- Format
- application/gzip
- Opis
- Web-as-Corpus 2020 Croatian texts
- MD5
- 967c44708a26cda5003696192793aab4

- Ime
- classla-sr.gz
- Velikost
- 1.85 GB
- Format
- application/gzip
- Opis
- Web-as-Corpus 2020 Serbian texts
- MD5
- 4ecf85bd53ba955ee5d2bbbf73a7d0f6

- Ime
- riznica.gz
- Velikost
- 217.57 MB
- Format
- application/gzip
- Opis
- Croatian newspaper and literary texts
- MD5
- df60a9569317ce390be0424604a446f1