| dc.contributor.author | Ulčar, Matej |
| dc.contributor.author | Robnik-Šikonja, Marko |
| dc.date.accessioned | 2020-12-29T16:55:33Z |
| dc.date.available | 2020-12-29T16:55:33Z |
| dc.date.issued | 2020-12-29 |
| dc.identifier.uri | http://hdl.handle.net/11356/1387 |
| dc.description | The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end. SloBERTa model is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool The released model here is a pytorch neural network model, intended for usage with the transformers library https://github.com/huggingface/transformers. |
| dc.language.iso | slv |
| dc.publisher | Faculty of Computer and Information Science, University of Ljubljana |
| dc.relation | info:eu-repo/grantAgreement/EC/H2020/825153 |
| dc.relation.isreplacedby | http://hdl.handle.net/11356/1397 |
| dc.rights | The MIT License (MIT) |
| dc.rights.uri | https://opensource.org/licenses/mit-license.php |
| dc.rights.label | PUB |
| dc.source.uri | https://rsdo.slovenscina.eu/en/semantic-resources-and-technologies |
| dc.subject | BERT |
| dc.subject | RoBERTa |
| dc.subject | word embeddings |
| dc.subject | language model |
| dc.subject | contextual embeddings |
| dc.title | Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0 |
| dc.type | toolService |
| metashare.ResourceInfo#ContentInfo.detailedType | other |
| metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent | true |
| hidden | hidden |
| has.files | yes |
| branding | CLARIN.SI data & tools |
| contact.person | Matej Ulčar matej.ulcar@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana |
| sponsor | European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153 |
| sponsor | Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other |
| files.count | 4 |
| files.size | 444073197 |
Files in this item
Download all files in item (423.5 MB)
- Name
- config.json
- Size
- 520 bytes
- Format
- Unknown
- Description
- Configuration file, describing the model's architecture
- MD5
- 00189fa49a298a01e689fafbffff9fb5
- Name
- pytorch_model.bin
- Size
- 422.32 MB
- Format
- Unknown
- Description
- SloBERTa model
- MD5
- e84c8114553e35003c8b5777534c8d4a
- Name
- sentencepiece.bpe.model
- Size
- 781.26 KB
- Format
- Unknown
- Description
- Sentencepiece model tokenizer
- MD5
- e4acd9c398b3fc68e1b2bc0f27d0d409
- Name
- dict.txt
- Size
- 424.06 KB
- Format
- Text file
- Description
- Sentencepiece subword token vocabulary
- MD5
- c3988aad6b209cf7f61667bbf2aba173
<unk> 999 #fairseq:overwrite
<s> 999 #fairseq:overwrite
</s> 999 #fairseq:overwrite
▁p 999
▁s 999
je 999
na 999
ni 999
ra 999
▁v 999
re 999
▁d 999
st 999
▁i 999
ne 999
▁z 999
ko 999
no 999
▁po 999
▁o 999
▁t 999
li 999
ja 999
ri 999
▁na 999
la 999
▁k 999
lo 999
me 999
▁in 999
le 999
ro 999
va 999
▁za 999
▁je 999
ve 999
te 999
▁b 999
ti 999
mo 999
ga 999
vi 999
di 999
ka 999
ma 999
▁se 999
jo 999
ji 999
vo 999
nje 999
ci 999
da 999
to 999
go 999
▁pre 999
po 999
ta 999
mi 999
▁pri 999
se 999
▁u 999
ke 999
ki 999
▁da 999
▁ko 999
de 999
▁do 999
▁te 999
▁so 999
če 999
▁ne 999
ce 999
▁iz 999
či 999
ju 999
sti 999
▁bi 999
▁a 999
▁pa 999
▁ra 999
▁P 999
▁ki 999
▁ka 999
▁od 999
▁ob 999
nih 999
▁de 999
▁S 999
ča 999
ru 999
▁ve 999
▁bo 999
do 999
sta 999
▁š 999
nja 999
▁( 999
▁me 999
▁1 999
že 999
▁mo 999
▁raz 999
ku 999
▁tu 999
bi 999
lja 999
▁e 999
▁V 999
▁2 999
▁pro 999
▁le 999
lje 999
ns 999
be 999
sa 999
sto 999
zi 999
za 999
▁N 999
▁T 999
▁ni 999
pa 999
bo 999
▁ta 999
pi 999
▁K 999
▁tudi 999 . . .