Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0

Name: Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0
License: https://opensource.org/licenses/mit-license.php

Ulčar, Matej; Robnik-Šikonja, Marko

Show simple item record

dc.contributor.author	Ulčar, Matej
dc.contributor.author	Robnik-Šikonja, Marko
dc.date.accessioned	2020-12-29T16:55:33Z
dc.date.available	2020-12-29T16:55:33Z
dc.date.issued	2020-12-29
dc.identifier.uri	http://hdl.handle.net/11356/1387
dc.description	The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end. SloBERTa model is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool The released model here is a pytorch neural network model, intended for usage with the transformers library https://github.com/huggingface/transformers.
dc.language.iso	slv
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.relation	info:eu-repo/grantAgreement/EC/H2020/825153
dc.relation.isreplacedby	http://hdl.handle.net/11356/1397
dc.rights	The MIT License (MIT)
dc.rights.uri	https://opensource.org/licenses/mit-license.php
dc.rights.label	PUB
dc.source.uri	https://rsdo.slovenscina.eu/en/semantic-resources-and-technologies
dc.subject	BERT
dc.subject	RoBERTa
dc.subject	word embeddings
dc.subject	language model
dc.subject	contextual embeddings
dc.title	Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	other
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
hidden	hidden
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Matej Ulčar matej.ulcar@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor	European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
sponsor	Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
files.count	4
files.size	444073197

Files in this item

Download all files in item (423.5 MB)

This item is

Publicly Available

and licensed under:
The MIT License (MIT)

Name: config.json
Size: 520 bytes
Format: Unknown
Description: Configuration file, describing the model's architecture
MD5: 00189fa49a298a01e689fafbffff9fb5

Download file

Name: pytorch_model.bin
Size: 422.32 MB
Format: Unknown
Description: SloBERTa model
MD5: e84c8114553e35003c8b5777534c8d4a

Download file

Name: sentencepiece.bpe.model
Size: 781.26 KB
Format: Unknown
Description: Sentencepiece model tokenizer
MD5: e4acd9c398b3fc68e1b2bc0f27d0d409

Download file

Name: dict.txt
Size: 424.06 KB
Format: Text file
Description: Sentencepiece subword token vocabulary
MD5: c3988aad6b209cf7f61667bbf2aba173

Download file Preview

File Preview

<unk> 999 #fairseq:overwrite
<s> 999 #fairseq:overwrite
</s> 999 #fairseq:overwrite
▁p 999
▁s 999
je 999
na 999
ni 999
ra 999
▁v 999
re 999
▁d 999
st 999
▁i 999
ne 999
▁z 999
ko 999
no 999
▁po 999
▁o 999
▁t 999
li 999
ja 999
ri 999
▁na 999
la 999
▁k 999
lo 999
me 999
▁in 999
le 999
ro 999
va 999
▁za 999
▁je 999
ve 999
te 999
▁b 999
ti 999
mo 999
ga 999
vi 999
di 999
ka 999
ma 999
▁se 999
jo 999
ji 999
vo 999
nje 999
ci 999
da 999
to 999
go 999
▁pre 999
po 999
ta 999
mi 999
▁pri 999
se 999
▁u 999
ke 999
ki 999
▁da 999
▁ko 999
de 999
▁do 999
▁te 999
▁so 999
če 999
▁ne 999
ce 999
▁iz 999
či 999
ju 999
sti 999
▁bi 999
▁a 999
▁pa 999
▁ra 999
▁P 999
▁ki 999
▁ka 999
▁od 999
▁ob 999
nih 999
▁de 999
▁S 999
ča 999
ru 999
▁ve 999
▁bo 999
do 999
sta 999
▁š 999
nja 999
▁( 999
▁me 999
▁1 999
že 999
▁mo 999
▁raz 999
ku 999
▁tu 999
bi 999
lja 999
▁e 999
▁V 999
▁2 999
▁pro 999
▁le 999
lje 999
ns 999
be 999
sa 999
sto 999
zi 999
za 999
▁N 999
▁T 999
▁ni 999
pa 999
bo 999
▁ta 999
pi 999
▁K 999
▁tudi 999 . . .

Show simple item record

Files in this item

Partners

Partners

Repository