Slovenian RoBERTa contextual embeddings model: SloBERTa 2.0

Name: Slovenian RoBERTa contextual embeddings model: SloBERTa 2.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Ulčar, Matej; Robnik-Šikonja, Marko

Prikaži enostavni zapis vnosa

dc.contributor.author	Ulčar, Matej
dc.contributor.author	Robnik-Šikonja, Marko
dc.date.accessioned	2021-02-17T16:58:47Z
dc.date.available	2021-02-17T16:58:47Z
dc.date.issued	2021-01-17
dc.identifier.uri	http://hdl.handle.net/11356/1397
dc.description	The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end. SloBERTa model is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool Compared with the previous version (1.0), this version was trained for further 61 epochs (v1.0 37 epochs, v2.0 98 epochs), for a total of 200,000 iterations/updates. The released model here is a pytorch neural network model, intended for usage with the transformers library https://github.com/huggingface/transformers (sloberta.2.0.transformers.tar.gz) or fairseq library https://github.com/pytorch/fairseq (sloberta.2.0.fairseq.tar.gz)
dc.language.iso	slv
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.relation	info:eu-repo/grantAgreement/EC/H2020/825153
dc.relation.replaces	http://hdl.handle.net/11356/1387
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://rsdo.slovenscina.eu/en/semantic-resources-and-technologies
dc.subject	BERT
dc.subject	RoBERTa
dc.subject	word embeddings
dc.subject	language model
dc.subject	contextual embeddings
dc.title	Slovenian RoBERTa contextual embeddings model: SloBERTa 2.0
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	other
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Matej Ulčar matej.ulcar@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor	European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
sponsor	Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
files.count	2
files.size	1387435323