Slovenian keyword extraction dataset from SentiNews 1.0

Name: Slovenian keyword extraction dataset from SentiNews 1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Koloski, Boshko; Martinc, Matej; Tavchioski, Ilija; Škrlj, Blaž; Pollak, Senja

Show simple item record

dc.contributor.author	Koloski, Boshko
dc.contributor.author	Martinc, Matej
dc.contributor.author	Tavchioski, Ilija
dc.contributor.author	Škrlj, Blaž
dc.contributor.author	Pollak, Senja
dc.date.accessioned	2022-06-30T12:51:05Z
dc.date.available	2022-06-30T12:51:05Z
dc.date.issued	2022-03-28
dc.identifier.uri	http://hdl.handle.net/11356/1495
dc.description	The dataset consists of 7514 Slovenian news articles from the SentiNews 1.0 corpus by Bučar et al. 2017 (http://hdl.handle.net/11356/1110) which had available article keywords. We provide the train and test data splits (5995 articles for training and 1519 for testing) that can be used for keyword extraction experiments. The format is a json file, containing the following fields: title, keywords, lang (always Slovene) and body (with the content of the article). In our paper we addressed keyword extraction in a cross-lingual setting: Koloski, Boshko, et al. "Out of Thin Air: Is Zero-Shot Cross-Lingual Keyword Detection Better Than Unsupervised?." arXiv preprint arXiv:2202.06650 (2022). [https://arxiv.org/pdf/2202.06650.pdf] For reproducing the results, you can use keyword datasets from the dataset http://hdl.handle.net/11356/1403 described in: Koloski, B., Pollak, S., Škrlj, B., & Martinc, M. (2021). Extending Neural Keyword Extraction with TF-IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Kiev, Ukraine, pages 22–29.
dc.language.iso	slv
dc.publisher	Jožef Stefan Institute
dc.relation	info:eu-repo/grantAgreement/EC/H2020/825153
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	http://candas.ijs.si/
dc.subject	keyword extraction
dc.subject	news corpus
dc.subject	Slovenian news articles
dc.title	Slovenian keyword extraction dataset from SentiNews 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
demo.uri	https://github.com/bkolosk1/CrossLingualKeywords
contact.person	Boshko Koloski boshko.koloski@ijs.si Boshko Koloski
sponsor	European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
sponsor	ARRS (Slovenian Research Agency) J6-2581 Računalniško podprta večjezična analiza novičarskega diskurza s kontekstualnimi besednimi vložitvami nationalFunds
sponsor	ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
size.info	7514 articles
files.count	2
files.size	6346801