dc.contributor.author | Jemec Tomazin, Mateja |
dc.contributor.author | Podpečan, Vid |
dc.contributor.author | Pollak, Senja |
dc.contributor.author | Thi Hong Tran, Hanh |
dc.contributor.author | Fajfar, Tanja |
dc.contributor.author | Atelšek, Simon |
dc.contributor.author | Sitar, Jera |
dc.contributor.author | Žagar Karer, Mojca |
dc.date.accessioned | 2023-05-19T11:31:58Z |
dc.date.available | 2023-05-19T11:31:58Z |
dc.date.issued | 2023-05-19 |
dc.identifier.uri | http://hdl.handle.net/11356/1841 |
dc.description | The Slovene Definition Extraction evaluation datasets RSDO-def contains sentences extracted from the Corpus of term-annotated texts RSDO5 1.1 (http://hdl.handle.net/11356/1470), which contains texts with annotated terms from four different domains: biomechanics, linguistics, chemistry, and veterinary science. The file and sentence identifiers are the same as in the original RSDO corpus. The labels added to the sentences included in the dataset denote: 0: Non-definition 1: Weak definition 2: Definition The dataset consists of two parts: 1. RSDO-def-random employed a random sampling strategy, with 14 definitions, 98 weak-definitions and 849 non-definitions. 2. RSDO-def-larger added sentences to the random one by the pattern-based definition extraction as presented in Pollak et al. (2014). It contains 169 definitions, 214 weak-definitions and 872 non-definitions. Both parts were manually annotated by five terminographers. In case of discrepancies between annotators, a consensus was reached and the final label was confirmed by all five annotators. Duplicates were removed in both parts. The criteria for annotation are based on the standard ISO 1087-1:2000 (E/F) Terminology Work - Vocabulary, Part 1, Theory and Application, which explains a definition as follows: "Representation of a concept by a descriptive statement which serves to differentiate it from related concepts". Weak definition labels were assigned if the extracted sentences contained a term and at least one delimiting feature without a superordinate concept, or sentences consisting of superordinate concepts without delimiting features but with some typical examples. Instances were labeled as Non-definition if the sentence with the extracted concept did not contain any information about the concept or its delimiting features. The dataset is described in more detail in Tran et al. 2023, where it was used for evaluating definition extraction approaches. If you use this resource, please cite: Tran, T.H.H., Podpečan, V., Jemec Tomazin, M., Pollak, Senja (2023). Definition Extraction for Slovene: Patterns, Transformer Classifiers and ChatGPT. Proceedings of the ELEX 2023: Electronic lexicography in the 21st century. Invisible lexicography: everywhere lexical data is used without users realizing they make use of a “dictionary” (accepted) Reference to the pattern-based definition extraction method used for creating RSDO-def-larger: Pollak, S. (2014). Extracting definition candidates from specialized corpora. Slovenščina 2.0: empirical, applied and interdisciplinary research, 2(1), pp. 1–40. https://doi.org/10.4312/slo2.0.2014.1.1-40 Related resources: - Jemec Tomazin, M. et al. (2021). Corpus of term-annotated texts RSDO5 1.1, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1470. - Podpečan et al. (2023). DF_NDF_wiki_slo: Definition extraction training sets from Wikipedia, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1840. |
dc.language.iso | slv |
dc.publisher | ZRC SAZU |
dc.publisher | Jožef Stefan Institute |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://rsdo.slovenscina.eu/ |
dc.subject | definitions |
dc.subject | definition extraction |
dc.title | Slovenian Definition Extraction evaluation datasets RSDO-def 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Mateja Jemec Tomazin mateja.jemec-tomazin@zrc-sazu.si Znanstvenoraziskovalni center SAZU |
sponsor | Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
size.info | 2216 sentences |
files.count | 2 |
files.size | 502996 |
Datoteke v tem vnosu
Prenesi vse datoteke v vnosu (491.21 KB)To je vnos
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Ime
- rsdo_def_random.csv
- Velikost
- 218.96 KB
- Format
- Datoteka CSV
- Opis
- Random sampling, CSV format
- MD5
- f9d686371e5fe6a1f6a12082adff29fa

- Ime
- rsdo_def_larger.csv
- Velikost
- 272.25 KB
- Format
- Datoteka CSV
- Opis
- Pattern-based sample, CSV format
- MD5
- 05f3590dd8211919081450cc856a0d10