Slovene corpus for general relation extraction SloREL 1.0

Name: Slovene corpus for general relation extraction SloREL 1.0
License: https://creativecommons.org/licenses/by/4.0/

Štravs, Miha; Knez, Timotej; Žitnik, Slavko

Show simple item record

dc.contributor.author	Štravs, Miha
dc.contributor.author	Knez, Timotej
dc.contributor.author	Žitnik, Slavko
dc.date.accessioned	2022-09-21T09:11:35Z
dc.date.available	2022-09-21T09:11:35Z
dc.date.issued	2022-09-15
dc.identifier.uri	http://hdl.handle.net/11356/1685
dc.description	The SloREL corpus contains annotations for training relation extraction models on Slovene documents. It contains documents from Slovene Wikipedia with annotated entities and relations. We constructed the annotations using a semi-supervised process based on linking the documents to the WikiData knowledge graph. The corpus contains 244,437 sentences from Slovene Wikipedia pages. We also provide 896 additional sentences collected from the 24ur.com news website with annotated and linked entities, which do not contain annotated relations and are meant for additional testing of the models. The entities in our corpus are linked to the entities in the WikiData knowledge graph which is useful for models that take advantage of additional knowledge from a knowledge graph. All together the corpus comprises 245,333 sentences with 813,952 relations and 1,616,193 entities. The corpus comprises of multiple documents: - schema-definition.xsd: defines the structure of the xml documents containing relation annotations. - wikipedia-train.xml: training portion of the wikipedia corpus - wikipedia-test.xml: testing portion of the wikipedia corpus - wikipedia-validation.xml: validation portion of the wikipedia corpus - 24ur.xml: additional sentences from the 24ur.com news articles
dc.language.iso	slv
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.relation.isreferencedby	http://hdl.handle.net/20.500.12556/RUL-138295
dc.relation.isreplacedby	http://hdl.handle.net/11356/1730
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.subject	Wikipedia
dc.subject	semi-supervised
dc.subject	semantic relations
dc.title	Slovene corpus for general relation extraction SloREL 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
hidden	hidden
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Timotej Knez timotej.knez@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
size.info	245333 sentences
size.info	41673426 bytes
files.count	1
files.size	41673426