Prikaži enostavni zapis vnosa
dc.contributor.author |
Ulčar, Matej |
dc.date.accessioned |
2019-11-25T14:34:36Z |
dc.date.available |
2019-11-25T14:34:36Z |
dc.date.issued |
2019-11-25 |
dc.identifier.uri |
http://hdl.handle.net/11356/1277 |
dc.description |
ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on large monolingual corpora for 7 languages: Slovenian, Croatian, Finnish, Estonian, Latvian, Lithuanian and Swedish.
Each language's model was trained for approximately 10 epochs. Corpora sizes used in training range from over 270 M tokens in Latvian to almost 2 B tokens in Croatian. About 1 million most common tokens were provided as vocabulary during the training for each language model. The model can also infer OOV words, since the neural network input is on the character level.
Each model is in its own .tar.gz archive, consisting of two files: pytorch weights (.hdf5) and options (.json). Both are needed for model inference, using allennlp (https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) python library. |
dc.language.iso |
slv |
dc.language.iso |
hrv |
dc.language.iso |
fin |
dc.language.iso |
est |
dc.language.iso |
lav |
dc.language.iso |
lit |
dc.language.iso |
swe |
dc.publisher |
Faculty of Computer and Information Science, University of Ljubljana |
dc.relation |
info:eu-repo/grantAgreement/EC/H2020/825153 |
dc.relation.isreferencedby |
https://arxiv.org/abs/1911.10049 |
dc.relation.replaces |
http://hdl.handle.net/11356/1257 |
dc.rights |
Apache License 2.0 |
dc.rights.uri |
https://opensource.org/licenses/Apache-2.0 |
dc.rights.label |
PUB |
dc.source.uri |
http://embeddia.eu |
dc.subject |
ELMo |
dc.subject |
contextual embeddings |
dc.subject |
word embeddings |
dc.title |
ELMo embeddings models for seven languages |
dc.type |
toolService |
metashare.ResourceInfo#ContentInfo.detailedType |
other |
metashare.ResourceInfo#ContentInfo.mediaType |
text |
has.files |
yes |
branding |
CLARIN.SI data & tools |
contact.person |
Matej Ulčar matej.ulcar@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana |
sponsor |
European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153 |
size.info |
7 files |
size.info |
1.4 gb |
files.count |
7 |
files.size |
1450271450 |
Prikaži enostavni zapis vnosa