ELMo embeddings models for seven languages

Name: ELMo embeddings models for seven languages
License: https://opensource.org/licenses/Apache-2.0

Ulčar, Matej

dc.contributor.author	Ulčar, Matej
dc.date.accessioned	2019-11-25T14:34:36Z
dc.date.available	2019-11-25T14:34:36Z
dc.date.issued	2019-11-25
dc.identifier.uri	http://hdl.handle.net/11356/1277
dc.description	ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on large monolingual corpora for 7 languages: Slovenian, Croatian, Finnish, Estonian, Latvian, Lithuanian and Swedish. Each language's model was trained for approximately 10 epochs. Corpora sizes used in training range from over 270 M tokens in Latvian to almost 2 B tokens in Croatian. About 1 million most common tokens were provided as vocabulary during the training for each language model. The model can also infer OOV words, since the neural network input is on the character level. Each model is in its own .tar.gz archive, consisting of two files: pytorch weights (.hdf5) and options (.json). Both are needed for model inference, using allennlp (https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) python library.
dc.language.iso	slv
dc.language.iso	hrv
dc.language.iso	fin
dc.language.iso	est
dc.language.iso	lav
dc.language.iso	lit
dc.language.iso	swe
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.relation	info:eu-repo/grantAgreement/EC/H2020/825153
dc.relation.isreferencedby	https://arxiv.org/abs/1911.10049
dc.relation.replaces	http://hdl.handle.net/11356/1257
dc.rights	Apache License 2.0
dc.rights.uri	https://opensource.org/licenses/Apache-2.0
dc.rights.label	PUB
dc.source.uri	http://embeddia.eu
dc.subject	ELMo
dc.subject	contextual embeddings
dc.subject	word embeddings
dc.title	ELMo embeddings models for seven languages
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	other
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Matej Ulčar matej.ulcar@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor	European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
size.info	7 files
size.info	1.4 gb
files.count	7
files.size	1450271450