dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Fišer, Darja |
dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Bren, Urban |
dc.contributor.author | Robnik-Šikonja, Marko |
dc.contributor.author | Udovič, Boštjan |
dc.date.accessioned | 2018-08-18T12:09:21Z |
dc.date.available | 2018-08-18T12:09:21Z |
dc.date.issued | 2018-08-18 |
dc.identifier.uri | http://hdl.handle.net/11356/1198 |
dc.description | The dataset contains 22,950 term candidates extracted from 15 Slovenian PhD theses. The term candidates are of length 1 to 4, extracted via morphosyntactic patterns and the frequency threshold of 3. The PhD theses are from the areas of chemistry, computer science and political science. Each of the term candidates is annotated by four annotators as being (1) in-domain term, (2) out-of-domain term, (3) general academic term or (4) not a term. Each term candidate is also annotated with its frequency in the PhD thesis and 7 statistical measures. The resource can serve as a training set for supervised learning of term extraction and for terminology extraction tool benchmarking. |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://nl.ijs.si/kas/ |
dc.subject | terminology |
dc.subject | manual annotation |
dc.title | Terminology identification dataset KAS-term 1.0 |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType | terminologicalResource |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
demo.uri | https://github.com/clarinsi/kas-term |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) J6-7094 Slovene scientific texts: resources and description nationalFunds |
size.info | 22950 entries |
files.count | 4 |
files.size | 18098504 |
Files in this item
Download all files in item (17.26 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- kas.term.json
- Size
- 13.47 MB
- Format
- Unknown
- Description
- Lexicon in JSON format
- MD5
- d162802ca09cd12d3b624d353ec22ed9

- Name
- kas.term.csv
- Size
- 3.47 MB
- Format
- CSV file
- Description
- Lexicon in CSV format
- MD5
- 434795ea3191e24c1627f7a28726cd20

- Name
- kas.term.txt
- Size
- 1.42 KB
- Format
- Text file
- Description
- Attribute descriptions
- MD5
- 3b4ea1dfab0b7bd725f254b3c02cd9de
Attribute descriptions document_id - ID of the document (PhD thesis) the term candidate is extracted from area - One of the three scientific areas the PhD thesis covers (Kemija: Chemistry, Politologija: Political Science, Računalništvo: Computer Science) annotation_round - The annotation round the term candidate was annotated lemma_sequence - Sequence of lemmas of the term candidate most_frequent_sequence - Sequence of most frequent tokens of the term candidate (does not have to be the canonical form) pattern - Morphosyntactic pattern the term candidate satisfies length - Length of the term candidate annotator_1 - Response of annotator 1 (annotator number is a pseudoidentifier of a human annotator throughout one area, different annotators were used for each area) annotator_2 - Response of annotator 2 (t_termin: term, x_izvenpodročni: out-of-domain term, z_znanstveno: scientific term, n_nerelevantno: no term) annotator_3 - Response of annotator 3 annotator_4 - Response of annotator 4 f . . .

- Name
- Navodila_za_ocenjevanje_terminoloskih_kandidatov_KAS.pdf
- Size
- 331.51 KB
- Format
- Description
- Guidelines for annotation of term candidates (in Slovenian)
- MD5
- a3ee1395fc0557872d5e33bd94af25d9