CorefUD conversion of Slovene corpus for aspect-based sentiment analysis SentiCoref

Name: CorefUD conversion of Slovene corpus for aspect-based sentiment analysis SentiCoref
License: https://creativecommons.org/licenses/by-sa/4.0/

Klemen, Matej; Žitnik, Slavko

Show simple item record

dc.contributor.author	Klemen, Matej
dc.contributor.author	Žitnik, Slavko
dc.date.accessioned	2024-11-21T08:11:50Z
dc.date.available	2024-11-21T08:11:50Z
dc.date.issued	2024-11-17
dc.identifier.uri	http://hdl.handle.net/11356/1990
dc.description	This corpus is the CorefUD conversion of the SentiCoref corpus for coreference resolution in Slovene contained within the SUK 1.1 collection of corpora (http://hdl.handle.net/11356/1959). SentiCoref contains 756 documents annotated with coreference information. Coreference in Universal Dependencies (CorefUD) is an initiative to collect coreference corpora in various languages and harmonize them to the same scheme and data format (CoNLL-U). The coreference information is stored in the MISC column. More concretely, the start and end of each coreference mention is marked with the "Entity=" attribute. For example, "Entity=(e0" marks the start of the entity e0 at the current token while "Entity=e0) marks the end of the entity e0 at the current token. For full details on the format, please see http://hdl.handle.net/11234/1-5478. To ensure compliance with the CoNLL-U format, the corpus was automatically annotated with trankit v1.1.2 to obtain universal part of speech tags (UPOS) and dependencies (head, dependency relation), while the remainder of annotations (lemmas, XPOS - MULTEXT-East V6, features) were copied from the SUK 1.1 resource. To enable implementation into the SloBENCH evaluation framework (https://slobench.cjvt.si/), we release the labeled SentiCoref corpus (training set) and an unlabeled test set. To prevent accidental data leaks, the test set labels are not publicly released, and are only indirectly accesible via the SloBENCH evaluation framework. In comparison to the original SentiCoref corpus, this contains the same texts and coreference information in a different (more universal) format. Additionally it contains 81 unlabeled private test set texts.
dc.language.iso	slv
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.subject	coreference resolution
dc.subject	corefud
dc.subject	conllu
dc.title	CorefUD conversion of Slovene corpus for aspect-based sentiment analysis SentiCoref
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Matej Klemen matej.klemen@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info	837 texts
files.count	3
files.size	37578000

Files in this item

Download all files in item (35.84 MB)

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: senticoref_private_corefud_unlabeled.conllu
Size: 2.77 MB
Format: Unknown
Description: Unlabeled SentiCoref test set in CoNLL-U format.
MD5: 3edae210479e8e79b1fef9f968cb91a4

Download file

Name: senticoref_corefud.conllu
Size: 33.06 MB
Format: Unknown
Description: Labeled SentiCoref training set in CoNLL-U format.
MD5: 0e9d5f3fbe4698cc96c3f92a2dd4f7fb

Download file

Name: README.txt
Size: 2.18 KB
Format: Text file
Description: Description of the resource.
MD5: e9d91eea5c52934731d0cb1d1f955832

Download file Preview

File Preview

CorefUD conversion of Slovene corpus for aspect-based sentiment analysis SentiCoref
v1.0
http://hdl.handle.net/11356/1990
CC-BY-SA 4.0

This corpus is the CorefUD conversion of the SentiCoref corpus for coreference resolution in Slovene contained within the SUK 1.1 collection of corpora (http://hdl.handle.net/11356/1959).
The item contains 756 labeled training (senticoref_corefud.conllu) and 81 unlabeled test documents (senticoref_private_corefud_unlabeled.conllu) annotated with coreference information.

Coreference in Universal Dependencies (CorefUD) is an initiative to collect coreference corpora in various languages and harmonize them to the same scheme and data format (CoNLL-U).
The coreference information is stored in the MISC column. More concretely, the start and end of each coreference mention is marked with the "Entity=" attribute. For example, "Entity=(e0" marks the start of the entity e0 at the current token while "Entity=e0) marks the end of the entity e0 at the current tok . . .

Show simple item record

Files in this item

Partners

Partners

Repository