dc.contributor.author | Terčon, Luka |
dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Erjavec, Tomaž |
dc.date.accessioned | 2023-04-14T13:19:44Z |
dc.date.available | 2023-04-14T13:19:44Z |
dc.date.issued | 2023-04-11 |
dc.identifier.uri | http://hdl.handle.net/11356/1791 |
dc.description | CLARIN.SI-embed.sl contains word embeddings induced from a large collection of Slovene texts composed of existing corpora of Slovene, e.g GigaFida, Janes, KAS, slWaC, MaCoCu-sl, etc. The embeddings are based on the skip-gram model of fastText trained on 5,791,405,942 tokens of running text for 3,471,054 lowercased surface forms. The difference to the previous version of the embeddings is that this version was trained on the original dataset expanded with the MaCoCu-sl web crawl corpus (http://hdl.handle.net/11356/1517). |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.relation.replaces | http://hdl.handle.net/11356/1204 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.subject | word embeddings |
dc.title | Word embeddings CLARIN.SI-embed.sl 2.0 |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType | computationalLexicon |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
contact.person | Luka Terčon luka.tercon@gmail.com Faculty of Computer and Information Science, University of Ljubljana |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other |
sponsor | Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other |
sponsor | ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
size.info | 3471054 entries |
files.count | 2 |
files.size | 4531095604 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- all-token-prelim.ft.sg.vec.zip
- Size
- 1.1 GB
- Format
- application/zip
- Description
- Token embeddings, text format
- MD5
- 456d31cc1463242fbd242b4598515554

- Name
- all-token-prelim.ft.sg.bin.zip
- Size
- 3.12 GB
- Format
- application/zip
- Description
- Token embeddings, binary
- MD5
- d2de7b3d5a1c90a5c8de702a6322fc06