dc.contributor.author | Koloski, Boshko |
dc.contributor.author | Martinc, Matej |
dc.contributor.author | Tavchioski, Ilija |
dc.contributor.author | Škrlj, Blaž |
dc.contributor.author | Pollak, Senja |
dc.date.accessioned | 2022-06-30T12:51:05Z |
dc.date.available | 2022-06-30T12:51:05Z |
dc.date.issued | 2022-03-28 |
dc.identifier.uri | http://hdl.handle.net/11356/1495 |
dc.description | The dataset consists of 7514 Slovenian news articles from the SentiNews 1.0 corpus by Bučar et al. 2017 (http://hdl.handle.net/11356/1110) which had available article keywords. We provide the train and test data splits (5995 articles for training and 1519 for testing) that can be used for keyword extraction experiments. The format is a json file, containing the following fields: title, keywords, lang (always Slovene) and body (with the content of the article). In our paper we addressed keyword extraction in a cross-lingual setting: Koloski, Boshko, et al. "Out of Thin Air: Is Zero-Shot Cross-Lingual Keyword Detection Better Than Unsupervised?." arXiv preprint arXiv:2202.06650 (2022). [https://arxiv.org/pdf/2202.06650.pdf] For reproducing the results, you can use keyword datasets from the dataset http://hdl.handle.net/11356/1403 described in: Koloski, B., Pollak, S., Škrlj, B., & Martinc, M. (2021). Extending Neural Keyword Extraction with TF-IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Kiev, Ukraine, pages 22–29. |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.relation | info:eu-repo/grantAgreement/EC/H2020/825153 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://candas.ijs.si/ |
dc.subject | keyword extraction |
dc.subject | news corpus |
dc.subject | Slovenian news articles |
dc.title | Slovenian keyword extraction dataset from SentiNews 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
demo.uri | https://github.com/bkolosk1/CrossLingualKeywords |
contact.person | Boshko Koloski boshko.koloski@ijs.si Boshko Koloski |
sponsor | European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153 |
sponsor | ARRS (Slovenian Research Agency) J6-2581 Računalniško podprta večjezična analiza novičarskega diskurza s kontekstualnimi besednimi vložitvami nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
size.info | 7514 articles |
files.count | 2 |
files.size | 6346801 |
Files in this item
Download all files in item (6.05 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- slovenian_test.json.gz
- Size
- 1.24 MB
- Format
- application/gzip
- Description
- Test split of the keyword extraction subcorpus.
- MD5
- 09c67c29f560f45233b7db8978d6f879

- Name
- slovenian_train.json.gz
- Size
- 4.81 MB
- Format
- application/gzip
- Description
- Train split of the keyword extraction subcorpus.
- MD5
- 03019fde5fde0219e6e6249f58d358f6