Prikaži enostavni zapis vnosa

 
dc.contributor.author Ulčar, Matej
dc.contributor.author Robnik-Šikonja, Marko
dc.date.accessioned 2021-02-17T16:58:47Z
dc.date.available 2021-02-17T16:58:47Z
dc.date.issued 2021-01-17
dc.identifier.uri http://hdl.handle.net/11356/1397
dc.description The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end. SloBERTa model is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool Compared with the previous version (1.0), this version was trained for further 61 epochs (v1.0 37 epochs, v2.0 98 epochs), for a total of 200,000 iterations/updates. The released model here is a pytorch neural network model, intended for usage with the transformers library https://github.com/huggingface/transformers (sloberta.2.0.transformers.tar.gz) or fairseq library https://github.com/pytorch/fairseq (sloberta.2.0.fairseq.tar.gz)
dc.language.iso slv
dc.publisher Faculty of Computer and Information Science, University of Ljubljana
dc.relation info:eu-repo/grantAgreement/EC/H2020/825153
dc.relation.replaces http://hdl.handle.net/11356/1387
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://rsdo.slovenscina.eu/en/semantic-resources-and-technologies
dc.subject BERT
dc.subject RoBERTa
dc.subject word embeddings
dc.subject language model
dc.subject contextual embeddings
dc.title Slovenian RoBERTa contextual embeddings model: SloBERTa 2.0
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType other
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding CLARIN.SI data & tools
contact.person Matej Ulčar matej.ulcar@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
sponsor Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
files.count 2
files.size 1387435323


 Datoteke v tem vnosu

Icon
Ime
sloberta.2.0.transformers.tar.gz
Velikost
249.44 MB
Format
application/gzip
Opis
SloBERTa 2.0 model for use with transformers toolset.
MD5
0afe61f4cdd7f2977db2a077bc3d4091
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • config.json520 B
    • dict.txt424 kB
    • pytorch_model.bin422 MB
    • sentencepiece.bpe.model781 kB
Icon
Ime
sloberta.2.0.fairseq.tar.gz
Velikost
1.05 GB
Format
application/gzip
Opis
SloBERTa 2.0 model for use with fairseq toolset/library.
MD5
e0f9d421e2fd33a524fbed193c0f1dae
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • dict.txt424 kB
    • model.pt1 GB
    • sl_spm.model781 kB

Prikaži enostavni zapis vnosa