dc.contributor.author | Terčon, Luka |
dc.contributor.author | Ljubešić, Nikola |
dc.date.accessioned | 2023-04-14T08:39:17Z |
dc.date.available | 2023-04-14T08:39:17Z |
dc.date.issued | 2023-04-11 |
dc.identifier.uri | http://hdl.handle.net/11356/1796 |
dc.description | CLARIN.SI-embed.bg contains word embeddings for Bulgarian induced from the MaCoCu-bg web crawl corpus (http://hdl.handle.net/11356/1515). The embeddings are based on the skip-gram model of fastText trained on 4,120,343,820 tokens of running text for 2,746,640 lowercased surface forms. |
dc.language.iso | bul |
dc.publisher | Jožef Stefan Institute |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.subject | word embeddings |
dc.title | Word embeddings CLARIN.SI-embed.bg 1.0 |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType | computationalLexicon |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Luka Terčon luka.tercon@gmail.com Faculty of Computer and Information Science, University of Ljubljana |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds |
sponsor | Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
size.info | 2746640 entries |
files.count | 2 |
files.size | 3751647207 |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Ime
- all-token-bg-prelim.ft.sg.vec.zip
- Velikost
- 893.45 MB
- Format
- application/zip
- Opis
- Token embeddings, text format
- MD5
- 1f4b18ae186532369ba43e40a80eb6bf

- Ime
- all-token-bg-prelim.ft.sg.bin.zip
- Velikost
- 2.62 GB
- Format
- application/zip
- Opis
- Token embeddings, binary
- MD5
- 5a93e54d5276e9141e8467f7c36d2cfc