| dc.contributor.author | Ulčar, Matej |
| dc.contributor.author | Robnik-Šikonja, Marko |
| dc.date.accessioned | 2020-07-09T12:32:41Z |
| dc.date.available | 2020-07-09T12:32:41Z |
| dc.date.issued | 2020-07-09 |
| dc.identifier.uri | http://hdl.handle.net/11356/1330 |
| dc.description | Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by finetuning the model end-to-end. CroSloEngual BERT are neural network weights and configuration files in pytorch format (i.e. to be used with pytorch library). Changes in version 1.1: fixed vocab.txt file, as previous verson had an error causing very bad results during fine-tuning and/or evaluation. |
| dc.language.iso | hrv |
| dc.language.iso | slv |
| dc.language.iso | eng |
| dc.publisher | Faculty of Computer and Information Science, University of Ljubljana |
| dc.relation | info:eu-repo/grantAgreement/EC/H2020/825153 |
| dc.relation.isreferencedby | https://arxiv.org/abs/2006.07890 |
| dc.relation.replaces | http://hdl.handle.net/11356/1317 |
| dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
| dc.rights.label | PUB |
| dc.source.uri | http://embeddia.eu |
| dc.subject | word embeddings |
| dc.subject | multilingual |
| dc.subject | contextual embeddings |
| dc.subject | BERT |
| dc.subject | language model |
| dc.title | CroSloEngual BERT 1.1 |
| dc.type | toolService |
| metashare.ResourceInfo#ContentInfo.detailedType | tool |
| metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent | true |
| has.files | yes |
| branding | CLARIN.SI data & tools |
| contact.person | Matej Ulčar matej.ulcar@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana |
| sponsor | European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153 |
| files.count | 3 |
| files.size | 499491056 |
Files in this item
Download all files in item (476.35 MB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- config.json
- Size
- 520 bytes
- Format
- Unknown
- Description
- Configuration file, describing the model's architecture
- MD5
- db3bdd5c4db6ffffa9bf3edab2e7be70
- Name
- pytorch_model.bin
- Size
- 476.04 MB
- Format
- Unknown
- Description
- CroSloEngual BERT model
- MD5
- 6b26401118943bf61b66a70d2ae68b9d
- Name
- vocab.txt
- Size
- 321.42 KB
- Format
- Text file
- Description
- Subword token (WordPiece) vocabulary
- MD5
- 08ab5bc48cb5a041611ed062eb368790
[PAD]
[EOS]
[unused00]
[unused0]
[unused1]
[unused2]
[unused3]
[unused4]
[unused5]
[unused6]
[unused7]
[unused8]
[unused9]
[unused10]
[unused11]
[unused12]
[unused13]
[unused14]
[unused15]
[unused16]
[unused17]
[unused18]
[unused19]
[unused20]
[unused21]
[unused22]
[unused23]
[unused24]
[unused25]
[unused26]
[unused27]
[unused28]
[unused29]
[unused30]
[unused31]
[unused32]
[unused33]
[unused34]
[unused35]
[unused36]
[unused37]
[unused38]
[unused39]
[unused40]
[unused41]
[unused42]
[unused43]
[unused44]
[unused45]
[unused46]
[unused47]
[unused48]
[unused49]
[unused50]
[unused51]
[unused52]
[unused53]
[unused54]
[unused55]
[unused56]
[unused57]
[unused58]
[unused59]
[unused60]
[unused61]
[unused62]
[unused63]
[unused64]
[unused65]
[unused66]
[unused67]
[unused68]
[unused69]
[unused70]
[unused71]
[unused72]
[unused73]
[unused74]
[unused75]
[unused76]
[unused77]
[unused78]
[unused79]
[unused80]
[unused81]
[unused82]
[unused83]
[unused84]
[unused85]
[unused86]
[unused87]
[unused88]
[unused8 . . .