Show simple item record

 
dc.contributor.author Kosem, Iztok
dc.contributor.author Gantar, Polona
dc.contributor.author Roblek, Rebeka
dc.contributor.author Zgaga, Karolina
dc.date.accessioned 2023-11-25T15:51:15Z
dc.date.available 2023-11-25T15:51:15Z
dc.date.issued 2023-11-24
dc.identifier.uri http://hdl.handle.net/11356/1903
dc.description This resource contains 713,310 collocation candidates, which were automatically extracted from the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320) and annotated whether they are legitimate collocations or not. The collocation candidates belong to three syntactic structures that are among the most common and semantically most informative collocational structures in the Slovenian language: - Verb + Noun in accusative (Structure_ID = 23; Structure_name = gg-s4;#_1_#-1_2_dve). Contains 163,229 annotated collocation candidates. - Adjective + Noun (Structure_ID = 34; Structure_name = p0-s0;2_1_dol-#_2_#). Contains 342,714 annotated collocation candidates. - Noun + Noun in genitive (Structure_ID = 53; Structure_name = s0-s2;#_1_#-1_2_dol). Contains 207,367 collocation candidates. Structure IDs and structure names are provided as used in the Digital Dictionary Database at the Centre for Language Resources and Technologies at the University of Ljubljana (https://www.cjvt.si/en/). In the annotation, three types of decision were possible: a) YES. The collocation candidate is a legitimate collocation, i.e., it is statistically relevant, represents the right syntactic structure, and shows meaningful but transparent semantic word combination. b) EXTENDED. The collocation candidate may be considered a collocation but in most cases or always requires a third element. c) NO. The collocation candidate is not a collocation. This can be for example because of a problem in lemmatisation, morphosyntactic annotation etc., or because the candidate is a compound, phrase etc., i.e., some other multiword unit. It should be noted that the annotation did not consider the criterion of collocation relevance, e.g., which collocations would make it into a dictionary or a related source. We consider this as a next step in using this data. However, part of the relevance has been included in the selection method, as the collocation candidates were selected using noun, adjective and verb headwords from Collocation Dictionary of Modern Slovene 1.0 (http://hdl.handle.net/11356/1250), taking up to top 30 collocations with a minimum frequency of 4 for each headword per syntactic structure.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.clarin.si/info/services/projects/#CLARINSI_project_reports_2023
dc.subject collocations
dc.subject manual annotation
dc.subject syntactic structures
dc.title Annotated collocation candidates for three common syntactic structures in Slovene
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType other
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Iztok Kosem iztok.kosem@cjvt.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor University of Ljubljana I0-0022 Network of Research Infrastructure Centres (MRIC) nationalFunds
size.info 713310 collocations
files.count 1
files.size 10175210


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
annotated-collocations.zip
Size
9.7 MB
Format
application/zip
Description
Annotated collocations as TSV files
MD5
a936e9919fecaaeea8df14a8404d530d
 Download file  Preview
 File Preview  

Show simple item record