Show simple item record

 
dc.contributor.author Terčon, Luka
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Osenova, Petya
dc.contributor.author Simov, Kiril
dc.date.accessioned 2023-06-29T05:52:16Z
dc.date.available 2023-06-29T05:52:16Z
dc.date.issued 2023-06-27
dc.identifier.uri http://hdl.handle.net/11356/1850
dc.description The model for lemmatisation of standard Bulgarian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the BulTreeBank training corpus (https://clarino.uib.no/korpuskel/corpora) and using the Bulgarian inflectional lexicon (Popov, Simov, and Vidinska 1998). The estimated F1 of the lemma annotations is ~98.93. The difference to the previous version of the lemmatizer is that this version was trained using the new version of the Bulgarian word embeddings.
dc.language.iso bul
dc.publisher Jožef Stefan Institute
dc.publisher IICT-BAS
dc.relation.isreferencedby http://dx.doi.org/10.18653/v1/W19-3704
dc.relation.replaces http://hdl.handle.net/11356/1353
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://github.com/clarinsi/classla
dc.subject lemmatisation
dc.subject language model
dc.title The CLASSLA-Stanza model for lemmatisation of standard Bulgarian 2.1
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
contact.person Luka Terčon luka.tercon@gmail.com Faculty of Computer and Information Science, University of Ljubljana
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor Ministry of Education and Science Republic of Bulgaria DO01-272/16.12.2019 Bulgarian National Interdisciplinary Research e-Infrastructure for Resources and Technologies CLaDA-BG nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
files.count 1
files.size 55523053


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
baseline_lemma_lemmatizer.zip
Size
52.95 MB
Format
application/zip
Description
Language model
MD5
c410dcb4106f9db6908a37620b939d6b
 Download file  Preview
 File Preview  
    • baseline_lemma_lemmatizer.pt-1 B

Show simple item record