dc.contributor.author | Ulčar, Matej |
dc.contributor.author | Robnik-Šikonja, Marko |
dc.date.accessioned | 2020-12-29T16:55:33Z |
dc.date.available | 2020-12-29T16:55:33Z |
dc.date.issued | 2020-12-29 |
dc.identifier.uri | http://hdl.handle.net/11356/1387 |
dc.description | The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end. SloBERTa model is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool The released model here is a pytorch neural network model, intended for usage with the transformers library https://github.com/huggingface/transformers. |
dc.language.iso | slv |
dc.publisher | Faculty of Computer and Information Science, University of Ljubljana |
dc.relation | info:eu-repo/grantAgreement/EC/H2020/825153 |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1397 |
dc.rights | The MIT License (MIT) |
dc.rights.uri | https://opensource.org/licenses/mit-license.php |
dc.rights.label | PUB |
dc.source.uri | https://rsdo.slovenscina.eu/en/semantic-resources-and-technologies |
dc.subject | BERT |
dc.subject | RoBERTa |
dc.subject | word embeddings |
dc.subject | language model |
dc.subject | contextual embeddings |
dc.title | Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0 |
dc.type | toolService |
metashare.ResourceInfo#ContentInfo.detailedType | other |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent | true |
hidden | hidden |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Matej Ulčar matej.ulcar@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana |
sponsor | European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153 |
sponsor | Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other |
files.count | 4 |
files.size | 444073197 |
Files in this item
Download all files in item (423.5 MB)
- Name
- config.json
- Size
- 520 bytes
- Format
- Unknown
- Description
- Configuration file, describing the model's architecture
- MD5
- 00189fa49a298a01e689fafbffff9fb5

- Name
- pytorch_model.bin
- Size
- 422.32 MB
- Format
- Unknown
- Description
- SloBERTa model
- MD5
- e84c8114553e35003c8b5777534c8d4a

- Name
- sentencepiece.bpe.model
- Size
- 781.26 KB
- Format
- Unknown
- Description
- Sentencepiece model tokenizer
- MD5
- e4acd9c398b3fc68e1b2bc0f27d0d409

- Name
- dict.txt
- Size
- 424.06 KB
- Format
- Text file
- Description
- Sentencepiece subword token vocabulary
- MD5
- c3988aad6b209cf7f61667bbf2aba173
<unk> 999 #fairseq:overwrite <s> 999 #fairseq:overwrite </s> 999 #fairseq:overwrite ▁p 999 ▁s 999 je 999 na 999 ni 999 ra 999 ▁v 999 re 999 ▁d 999 st 999 ▁i 999 ne 999 ▁z 999 ko 999 no 999 ▁po 999 ▁o 999 ▁t 999 li 999 ja 999 ri 999 ▁na 999 la 999 ▁k 999 lo 999 me 999 ▁in 999 le 999 ro 999 va 999 ▁za 999 ▁je 999 ve 999 te 999 ▁b 999 ti 999 mo 999 ga 999 vi 999 di 999 ka 999 ma 999 ▁se 999 jo 999 ji 999 vo 999 nje 999 ci 999 da 999 to 999 go 999 ▁pre 999 po 999 ta 999 mi 999 ▁pri 999 se 999 ▁u 999 ke 999 ki 999 ▁da 999 ▁ko 999 de 999 ▁do 999 ▁te 999 ▁so 999 če 999 ▁ne 999 ce 999 ▁iz 999 či 999 ju 999 sti 999 ▁bi 999 ▁a 999 ▁pa 999 ▁ra 999 ▁P 999 ▁ki 999 ▁ka 999 ▁od 999 ▁ob 999 nih 999 ▁de 999 ▁S 999 ča 999 ru 999 ▁ve 999 ▁bo 999 do 999 sta 999 ▁š 999 nja 999 ▁( 999 ▁me 999 ▁1 999 že 999 ▁mo 999 ▁raz 999 ku 999 ▁tu 999 bi 999 lja 999 ▁e 999 ▁V 999 ▁2 999 ▁pro 999 ▁le 999 lje 999 ns 999 be 999 sa 999 sto 999 zi 999 za 999 ▁N 999 ▁T 999 ▁ni 999 pa 999 bo 999 ▁ta 999 pi 999 ▁K 999 ▁tudi 999 . . .