Show simple item record

 
dc.contributor.author Ulčar, Matej
dc.contributor.author Robnik-Šikonja, Marko
dc.date.accessioned 2020-12-29T16:55:33Z
dc.date.available 2020-12-29T16:55:33Z
dc.date.issued 2020-12-29
dc.identifier.uri http://hdl.handle.net/11356/1387
dc.description The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end. SloBERTa model is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool The released model here is a pytorch neural network model, intended for usage with the transformers library https://github.com/huggingface/transformers.
dc.language.iso slv
dc.publisher Faculty of Computer and Information Science, University of Ljubljana
dc.relation info:eu-repo/grantAgreement/EC/H2020/825153
dc.relation.isreplacedby http://hdl.handle.net/11356/1397
dc.rights The MIT License (MIT)
dc.rights.uri https://opensource.org/licenses/mit-license.php
dc.rights.label PUB
dc.source.uri https://rsdo.slovenscina.eu/en/semantic-resources-and-technologies
dc.subject BERT
dc.subject RoBERTa
dc.subject word embeddings
dc.subject language model
dc.subject contextual embeddings
dc.title Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType other
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
hidden hidden
has.files yes
branding CLARIN.SI data & tools
contact.person Matej Ulčar matej.ulcar@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
sponsor Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
files.count 4
files.size 444073197


 Files in this item

 Download all files in item (423.5 MB)
This item is
Publicly Available
and licensed under:
The MIT License (MIT)
Icon
Name
config.json
Size
520 bytes
Format
Unknown
Description
Configuration file, describing the model's architecture
MD5
00189fa49a298a01e689fafbffff9fb5
 Download file
Icon
Name
pytorch_model.bin
Size
422.32 MB
Format
Unknown
Description
SloBERTa model
MD5
e84c8114553e35003c8b5777534c8d4a
 Download file
Icon
Name
sentencepiece.bpe.model
Size
781.26 KB
Format
Unknown
Description
Sentencepiece model tokenizer
MD5
e4acd9c398b3fc68e1b2bc0f27d0d409
 Download file
Icon
Name
dict.txt
Size
424.06 KB
Format
Text file
Description
Sentencepiece subword token vocabulary
MD5
c3988aad6b209cf7f61667bbf2aba173
 Download file  Preview
 File Preview  
<unk> 999 #fairseq:overwrite
<s> 999 #fairseq:overwrite
</s> 999 #fairseq:overwrite
▁p 999
▁s 999
je 999
na 999
ni 999
ra 999
▁v 999
re 999
▁d 999
st 999
▁i 999
ne 999
▁z 999
ko 999
no 999
▁po 999
▁o 999
▁t 999
li 999
ja 999
ri 999
▁na 999
la 999
▁k 999
lo 999
me 999
▁in 999
le 999
ro 999
va 999
▁za 999
▁je 999
ve 999
te 999
▁b 999
ti 999
mo 999
ga 999
vi 999
di 999
ka 999
ma 999
▁se 999
jo 999
ji 999
vo 999
nje 999
ci 999
da 999
to 999
go 999
▁pre 999
po 999
ta 999
mi 999
▁pri 999
se 999
▁u 999
ke 999
ki 999
▁da 999
▁ko 999
de 999
▁do 999
▁te 999
▁so 999
če 999
▁ne 999
ce 999
▁iz 999
či 999
ju 999
sti 999
▁bi 999
▁a 999
▁pa 999
▁ra 999
▁P 999
▁ki 999
▁ka 999
▁od 999
▁ob 999
nih 999
▁de 999
▁S 999
ča 999
ru 999
▁ve 999
▁bo 999
do 999
sta 999
▁š 999
nja 999
▁( 999
▁me 999
▁1 999
že 999
▁mo 999
▁raz 999
ku 999
▁tu 999
bi 999
lja 999
▁e 999
▁V 999
▁2 999
▁pro 999
▁le 999
lje 999
ns 999
be 999
sa 999
sto 999
zi 999
za 999
▁N 999
▁T 999
▁ni 999
pa 999
bo 999
▁ta 999
pi 999
▁K 999
▁tudi 999 . . .
                                            

Show simple item record