dc.contributor.author | Ljubešić, Nikola |
dc.date.accessioned | 2020-09-11T11:50:00Z |
dc.date.available | 2020-09-11T11:50:00Z |
dc.date.issued | 2020-09-11 |
dc.identifier.uri | http://hdl.handle.net/11356/1348 |
dc.description | The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1183) and using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~94.1. The difference to the previous version of the model is that now the whole XPOS tag is predicted and not specific characters, as was the case in stanfordnlp, which resulted in illegal XPOS tags (and slightly decreased performance). |
dc.language.iso | hrv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | http://dx.doi.org/10.18653/v1/W19-3704 |
dc.relation.replaces | http://hdl.handle.net/11356/1252 |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1393 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://github.com/clarinsi/classla-stanfordnlp |
dc.subject | language model |
dc.subject | part-of-speech tagging |
dc.title | The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Croatian 1.1 |
dc.type | toolService |
metashare.ResourceInfo#ContentInfo.detailedType | tool |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent | true |
hidden | hidden |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) J7-8280 FRENK: Resources, methods, and tools for the understanding, identification, and classification of various forms of socially unacceptable discourse in the information society nationalFunds |
sponsor | ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds |
files.count | 2 |
files.size | 854266899 |
Files in this item
Download all files in item (814.69 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- hr500k
- Size
- 78.41 MB
- Format
- Unknown
- Description
- Language model
- MD5
- 61e7e97c217ff5063eebaf7cb0b6b86d

- Name
- hr500k.pretrain.pt.zip
- Size
- 736.28 MB
- Format
- application/zip
- Description
- Pretrained word embeddings
- MD5
- 72bb027749ddd63d1b59c5b1951215f7