Prikaži enostavni zapis vnosa

 
dc.contributor.author Jemec Tomazin, Mateja
dc.contributor.author Podpečan, Vid
dc.contributor.author Pollak, Senja
dc.contributor.author Thi Hong Tran, Hanh
dc.contributor.author Fajfar, Tanja
dc.contributor.author Atelšek, Simon
dc.contributor.author Sitar, Jera
dc.contributor.author Žagar Karer, Mojca
dc.date.accessioned 2023-05-19T11:31:58Z
dc.date.available 2023-05-19T11:31:58Z
dc.date.issued 2023-05-19
dc.identifier.uri http://hdl.handle.net/11356/1841
dc.description The Slovene Definition Extraction evaluation datasets RSDO-def contains sentences extracted from the Corpus of term-annotated texts RSDO5 1.1 (http://hdl.handle.net/11356/1470), which contains texts with annotated terms from four different domains: biomechanics, linguistics, chemistry, and veterinary science. The file and sentence identifiers are the same as in the original RSDO corpus. The labels added to the sentences included in the dataset denote: 0: Non-definition 1: Weak definition 2: Definition The dataset consists of two parts: 1. RSDO-def-random employed a random sampling strategy, with 14 definitions, 98 weak-definitions and 849 non-definitions. 2. RSDO-def-larger added sentences to the random one by the pattern-based definition extraction as presented in Pollak et al. (2014). It contains 169 definitions, 214 weak-definitions and 872 non-definitions. Both parts were manually annotated by five terminographers. In case of discrepancies between annotators, a consensus was reached and the final label was confirmed by all five annotators. Duplicates were removed in both parts. The criteria for annotation are based on the standard ISO 1087-1:2000 (E/F) Terminology Work - Vocabulary, Part 1, Theory and Application, which explains a definition as follows: "Representation of a concept by a descriptive statement which serves to differentiate it from related concepts". Weak definition labels were assigned if the extracted sentences contained a term and at least one delimiting feature without a superordinate concept, or sentences consisting of superordinate concepts without delimiting features but with some typical examples. Instances were labeled as Non-definition if the sentence with the extracted concept did not contain any information about the concept or its delimiting features. The dataset is described in more detail in Tran et al. 2023, where it was used for evaluating definition extraction approaches. If you use this resource, please cite: Tran, T.H.H., Podpečan, V., Jemec Tomazin, M., Pollak, Senja (2023). Definition Extraction for Slovene: Patterns, Transformer Classifiers and ChatGPT. Proceedings of the ELEX 2023: Electronic lexicography in the 21st century. Invisible lexicography: everywhere lexical data is used without users realizing they make use of a “dictionary” (accepted) Reference to the pattern-based definition extraction method used for creating RSDO-def-larger: Pollak, S. (2014). Extracting definition candidates from specialized corpora. Slovenščina 2.0: empirical, applied and interdisciplinary research, 2(1), pp. 1–40. https://doi.org/10.4312/slo2.0.2014.1.1-40 Related resources: - Jemec Tomazin, M. et al. (2021). Corpus of term-annotated texts RSDO5 1.1, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1470. - Podpečan et al. (2023). DF_NDF_wiki_slo: Definition extraction training sets from Wikipedia, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1840.
dc.language.iso slv
dc.publisher ZRC SAZU
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://rsdo.slovenscina.eu/
dc.subject definitions
dc.subject definition extraction
dc.title Slovenian Definition Extraction evaluation datasets RSDO-def 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Mateja Jemec Tomazin mateja.jemec-tomazin@zrc-sazu.si Znanstvenoraziskovalni center SAZU
sponsor Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
sponsor ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
size.info 2216 sentences
files.count 2
files.size 502996


 Datoteke v tem vnosu

 Prenesi vse datoteke v vnosu (491.21 KB)
Icon
Ime
rsdo_def_random.csv
Velikost
218.96 KB
Format
Datoteka CSV
Opis
Random sampling, CSV format
MD5
f9d686371e5fe6a1f6a12082adff29fa
 Prenesi datoteko
Icon
Ime
rsdo_def_larger.csv
Velikost
272.25 KB
Format
Datoteka CSV
Opis
Pattern-based sample, CSV format
MD5
05f3590dd8211919081450cc856a0d10
 Prenesi datoteko

Prikaži enostavni zapis vnosa