Show simple item record

 
dc.contributor.author Koloski, Boshko
dc.contributor.author Martinc, Matej
dc.contributor.author Tavchioski, Ilija
dc.contributor.author Škrlj, Blaž
dc.contributor.author Pollak, Senja
dc.date.accessioned 2022-06-30T12:51:05Z
dc.date.available 2022-06-30T12:51:05Z
dc.date.issued 2022-03-28
dc.identifier.uri http://hdl.handle.net/11356/1495
dc.description The dataset consists of 7514 Slovenian news articles from the SentiNews 1.0 corpus by Bučar et al. 2017 (http://hdl.handle.net/11356/1110) which had available article keywords. We provide the train and test data splits (5995 articles for training and 1519 for testing) that can be used for keyword extraction experiments. The format is a json file, containing the following fields: title, keywords, lang (always Slovene) and body (with the content of the article). In our paper we addressed keyword extraction in a cross-lingual setting: Koloski, Boshko, et al. "Out of Thin Air: Is Zero-Shot Cross-Lingual Keyword Detection Better Than Unsupervised?." arXiv preprint arXiv:2202.06650 (2022). [https://arxiv.org/pdf/2202.06650.pdf] For reproducing the results, you can use keyword datasets from the dataset http://hdl.handle.net/11356/1403 described in: Koloski, B., Pollak, S., Škrlj, B., & Martinc, M. (2021). Extending Neural Keyword Extraction with TF-IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Kiev, Ukraine, pages 22–29.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.relation info:eu-repo/grantAgreement/EC/H2020/825153
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri http://candas.ijs.si/
dc.subject keyword extraction
dc.subject news corpus
dc.subject Slovenian news articles
dc.title Slovenian keyword extraction dataset from SentiNews 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
demo.uri https://github.com/bkolosk1/CrossLingualKeywords
contact.person Boshko Koloski boshko.koloski@ijs.si Boshko Koloski
sponsor European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
sponsor ARRS (Slovenian Research Agency) J6-2581 Računalniško podprta večjezična analiza novičarskega diskurza s kontekstualnimi besednimi vložitvami nationalFunds
sponsor ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
size.info 7514 articles
files.count 2
files.size 6346801


 Files in this item

 Download all files in item (6.05 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
slovenian_test.json.gz
Size
1.24 MB
Format
application/gzip
Description
Test split of the keyword extraction subcorpus.
MD5
09c67c29f560f45233b7db8978d6f879
 Download file
Icon
Name
slovenian_train.json.gz
Size
4.81 MB
Format
application/gzip
Description
Train split of the keyword extraction subcorpus.
MD5
03019fde5fde0219e6e6249f58d358f6
 Download file

Show simple item record