dc.contributor.author | Ulčar, Matej |
dc.contributor.author | Robnik-Šikonja, Marko |
dc.date.accessioned | 2020-07-09T12:32:41Z |
dc.date.available | 2020-07-09T12:32:41Z |
dc.date.issued | 2020-07-09 |
dc.identifier.uri | http://hdl.handle.net/11356/1330 |
dc.description | Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by finetuning the model end-to-end. CroSloEngual BERT are neural network weights and configuration files in pytorch format (i.e. to be used with pytorch library). Changes in version 1.1: fixed vocab.txt file, as previous verson had an error causing very bad results during fine-tuning and/or evaluation. |
dc.language.iso | hrv |
dc.language.iso | slv |
dc.language.iso | eng |
dc.publisher | Faculty of Computer and Information Science, University of Ljubljana |
dc.relation | info:eu-repo/grantAgreement/EC/H2020/825153 |
dc.relation.isreferencedby | https://arxiv.org/abs/2006.07890 |
dc.relation.replaces | http://hdl.handle.net/11356/1317 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://embeddia.eu |
dc.subject | word embeddings |
dc.subject | multilingual |
dc.subject | contextual embeddings |
dc.subject | BERT |
dc.subject | language model |
dc.title | CroSloEngual BERT 1.1 |
dc.type | toolService |
metashare.ResourceInfo#ContentInfo.detailedType | tool |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent | true |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Matej Ulčar matej.ulcar@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana |
sponsor | European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153 |
files.count | 3 |
files.size | 499491056 |
Files in this item
Download all files in item (476.35 MB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- config.json
- Size
- 520 bytes
- Format
- Unknown
- Description
- Configuration file, describing the model's architecture
- MD5
- db3bdd5c4db6ffffa9bf3edab2e7be70
- Name
- pytorch_model.bin
- Size
- 476.04 MB
- Format
- Unknown
- Description
- CroSloEngual BERT model
- MD5
- 6b26401118943bf61b66a70d2ae68b9d
- Name
- vocab.txt
- Size
- 321.42 KB
- Format
- Text file
- Description
- Subword token (WordPiece) vocabulary
- MD5
- 08ab5bc48cb5a041611ed062eb368790
[PAD] [EOS] [unused00] [unused0] [unused1] [unused2] [unused3] [unused4] [unused5] [unused6] [unused7] [unused8] [unused9] [unused10] [unused11] [unused12] [unused13] [unused14] [unused15] [unused16] [unused17] [unused18] [unused19] [unused20] [unused21] [unused22] [unused23] [unused24] [unused25] [unused26] [unused27] [unused28] [unused29] [unused30] [unused31] [unused32] [unused33] [unused34] [unused35] [unused36] [unused37] [unused38] [unused39] [unused40] [unused41] [unused42] [unused43] [unused44] [unused45] [unused46] [unused47] [unused48] [unused49] [unused50] [unused51] [unused52] [unused53] [unused54] [unused55] [unused56] [unused57] [unused58] [unused59] [unused60] [unused61] [unused62] [unused63] [unused64] [unused65] [unused66] [unused67] [unused68] [unused69] [unused70] [unused71] [unused72] [unused73] [unused74] [unused75] [unused76] [unused77] [unused78] [unused79] [unused80] [unused81] [unused82] [unused83] [unused84] [unused85] [unused86] [unused87] [unused88] [unused8 . . .