Show simple item record

 
dc.contributor.author Terčon, Luka
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Erjavec, Tomaž
dc.date.accessioned 2023-04-14T13:19:44Z
dc.date.available 2023-04-14T13:19:44Z
dc.date.issued 2023-04-11
dc.identifier.uri http://hdl.handle.net/11356/1791
dc.description CLARIN.SI-embed.sl contains word embeddings induced from a large collection of Slovene texts composed of existing corpora of Slovene, e.g GigaFida, Janes, KAS, slWaC, MaCoCu-sl, etc. The embeddings are based on the skip-gram model of fastText trained on 5,791,405,942 tokens of running text for 3,471,054 lowercased surface forms. The difference to the previous version of the embeddings is that this version was trained on the original dataset expanded with the MaCoCu-sl web crawl corpus (http://hdl.handle.net/11356/1517).
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.relation.replaces http://hdl.handle.net/11356/1204
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.subject word embeddings
dc.title Word embeddings CLARIN.SI-embed.sl 2.0
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType computationalLexicon
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
contact.person Luka Terčon luka.tercon@gmail.com Faculty of Computer and Information Science, University of Ljubljana
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
sponsor ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info 3471054 entries
files.count 2
files.size 4531095604


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
all-token-prelim.ft.sg.vec.zip
Size
1.1 GB
Format
application/zip
Description
Token embeddings, text format
MD5
456d31cc1463242fbd242b4598515554
 Download file  Preview
 File Preview  
    • all-token-prelim.ft.sg.vec-1 B
Icon
Name
all-token-prelim.ft.sg.bin.zip
Size
3.12 GB
Format
application/zip
Description
Token embeddings, binary
MD5
d2de7b3d5a1c90a5c8de702a6322fc06
 Download file  Preview
 File Preview  
    • all-token-prelim.ft.sg.bin-1 B

Show simple item record