Prikaži enostavni zapis vnosa
dc.contributor.author |
Žitko, Branko |
dc.contributor.author |
Gašpar, Angelina |
dc.contributor.author |
Bročić, Lucija |
dc.contributor.author |
Vasić, Daniel |
dc.date.accessioned |
2023-04-24T12:38:31Z |
dc.date.available |
2023-04-24T12:38:31Z |
dc.date.issued |
2023-04-24 |
dc.identifier.uri |
http://hdl.handle.net/11356/1822 |
dc.description |
NL2SH (Natural Language to Semantic Hypergraph) dataset can be used to build and evaluate methods for knowledge extraction and representation based on a semantic hypergraph. Each sentence has natural language annotations and dedicated semantic hyperedge. Majority of the sentences used in this dataset are taken from the following sources:
* John Eastwood, Oxford Guide to English Grammar, Oxford University Press, 2002.
* Andrew Redford, An Introduction to English Sentence Structure, Cambridge University Press, 2009.
* Essential English Grammar, Philip Gucker, Dover Publications, Inc. New York, 1966
Natural language annotations are:
* sent_i - id of the sentence
* tok_i - id of the token in the sentence
* word - token text
* space - does space follows the token
* lemma - lemma of the token
* pos - Universal POS tags (https://universaldependencies.org/u/pos/)
* tag - Penn Treebank tags (https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
* dep - ClearNLP depedency labels (https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md)
* head - id of the token which is a dependency head of the current token
* ner - named entities (https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf)
* roleset - roleset of a verb frame (https://propbank.github.io/v3.4.0/frames/)
* srl - semantic role labels with IOB annotation (https://verbs.colorado.edu/propbank/EPB-Annotation-Guidelines.pdf)
* coref - coreference labels with IOB annotation * synset - WordNet's synsets (https://wordnet.princeton.edu)
The annotations for semantic hypergraph elements primarily adhere to the annotation guidelines of the Graphbrain project (https://graphbrain.net/manual/notation.html). However, atom annotations are modified and at the end contains:
* label,
* type and optional subtype,
* type specific atom roles,
* type specific additional information,
* named entity |
dc.language.iso |
eng |
dc.publisher |
Faculty of Science University of Split |
dc.rights |
CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0 |
dc.rights.uri |
https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0 |
dc.rights.label |
ACA |
dc.source.uri |
https://www.acnltutor.net |
dc.subject |
semantic hypergraph |
dc.subject |
natural language processing |
dc.title |
Natural Language 2 Semantic Hypergraph Dataset NL2SH 1.0 |
dc.type |
corpus |
metashare.ResourceInfo#ContentInfo.mediaType |
text |
has.files |
yes |
branding |
CLARIN.SI data & tools |
demo.uri |
https://github.com/bzitko/nl2sh_repo |
contact.person |
Branko Žitko bzitko@pmfst.hr Faculty of Science University of Split |
sponsor |
Office of Naval Research N00014-20-1-2066 Enhancing Adaptive Courseware based on Natural Language Processing nationalFunds |
size.info |
664 sentences |
size.info |
6851 tokens |
size.info |
5968 words |
files.count |
1 |
files.size |
1096051 |
Datoteke v tem vnosu
- Ime
- nl2sh_dataset.txt
- Velikost
- 1.05
MB
- Format
- Besedilna datoteka
- Opis
- Dataset in TXT format
- MD5
- 8ea669eef7103a307496997db3ae4600
Prenesi datoteko
Prikaži enostavni zapis vnosa