<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href='static/style.xsl' type='text/xsl'?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2026-05-22T00:25:17Z</responseDate><request verb="GetRecord" identifier="oai:www.clarin.si:11356/1403" metadataPrefix="oai_dc">http://www.clarin.si/repository/oai/request</request><GetRecord><record><header><identifier>oai:www.clarin.si:11356/1403</identifier><datestamp>2024-11-06T17:16:17Z</datestamp><setSpec>hdl_11356_1023</setSpec><setSpec>hdl_11356_1024</setSpec></header><metadata><oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Keyword extraction datasets for Croatian, Estonian, Latvian and Russian 1.0</dc:title>
<dc:creator>Koloski, Boshko</dc:creator>
<dc:creator>Pollak, Senja</dc:creator>
<dc:creator>Škrlj, Blaž</dc:creator>
<dc:creator>Martinc, Matej</dc:creator>
<dc:subject>keyword extraction</dc:subject>
<dc:subject>news corpus</dc:subject>
<dc:description>EACL Hackashop Keyword Challenge Datasets&#xd;
&#xd;
In this repository you can find ids of articles used for the keyword extraction challenge at &#xd;
EACL Hackashop on News Media Content Analysis and Automated Report Generation (http://embeddia.eu/hackashop2021/). The article ids can be used to generate train-test split used in paper:&#xd;
&#xd;
Koloski, B., Pollak, S., Škrlj, B., &amp; Martinc, M. (2021). Extending Neural Keyword Extraction with TF-IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Kiev, Ukraine, pages 22–29.&#xd;
&#xd;
Train and test splits are provided for Latvian, Estonian, Russian and Croatian.&#xd;
&#xd;
The articles with the corresponding ID-s can be extracted from the following datasets:&#xd;
- Estonian and Russian (use the eearticles2015-2019 dataset): https://www.clarin.si/repository/xmlui/handle/11356/1408&#xd;
- Latvian: https://www.clarin.si/repository/xmlui/handle/11356/1409&#xd;
- Croatian: https://www.clarin.si/repository/xmlui/handle/11356/1410&#xd;
&#xd;
&#xd;
dataset_ids folder is organized in the following way:&#xd;
&#xd;
- latvian – containing latvian_train.json: a json file with ids from train articles to replicate the data used in Koloski et al. (2020), the latvian_test.json: a json file with ids from test articles to replicate the data&#xd;
&#xd;
- estonian – containing estonian_train.json: a json file with ids from train articles to replicate the data used in Koloski et al. (2020), the estonian_test.json: a json file with ids from test articles to replicate the data&#xd;
&#xd;
- russian – containing russian_train.json: a json file with ids from train articles to replicate the train data used in Koloski et al. (2020), the russian_test.json: a json file with ids from test articles to replicate the data&#xd;
&#xd;
- croatian - containing croatian_id_train.tsv file with sites and ids (note that just ids are not unique across dataset, therefore site information also needs to be included to obtain a unique article identifier) of articles in the train set, and the croatian_id_test.tsv file with sites and ids of articles in the test set. &#xd;
&#xd;
In addition, scripts are provided for extracting articles (see folder parse containing scripts parse.py and build_croatian_dataset.py, requirements for scripts are pandas and bs4 Python libraries):&#xd;
&#xd;
parse.py is used for extraction of Estonian, Russian and Latvian train and test datasets:&#xd;
&#xd;
Instructions:&#xd;
&#xd;
ESTONIAN-RUSSIAN &#xd;
1) Retrieve the data ee_articles_2015_2019.zip&#xd;
2) Create a folder 'data' and subfolder 'ee'&#xd;
3) Unzip them in the 'data/ee' folder&#xd;
&#xd;
To extract train/test Estonian articles: &#xd;
run function 'build_dataset(lang="ee", opt="nat")' in the parse.py script&#xd;
To extract train/test Russian articles:&#xd;
run function 'build_dataset(lang="ee", opt="rus")' in the parse.py script&#xd;
&#xd;
LATVIAN:&#xd;
1) Retrieve the latvian data&#xd;
2) Unzip it in 'data/lv' folder&#xd;
3) To extract train/test Latvian articles:&#xd;
run function 'build_dataset(lang="lv", opt="nat")' in the parse.py script&#xd;
&#xd;
build_croatian_dataset.py is used for extraction of Croatian train and test datasets:&#xd;
&#xd;
Instructions:&#xd;
&#xd;
CROATIAN: &#xd;
1) Retrieve the Croatian data (file 'STY_24sata_articles_hr_PUB-01.csv')&#xd;
2) put the script 'build_croatian_dataset.py' in the same folder as the extracted data and run it (e.g., python build_croatian_dataset.py).&#xd;
&#xd;
&#xd;
For additional questions: {Boshko.Koloski,Matej.Martinc,Senja.Pollak}@ijs.si</dc:description>
<dc:date>2021-06-04</dc:date>
<dc:type>corpus</dc:type>
<dc:identifier>http://hdl.handle.net/11356/1403</dc:identifier>
<dc:language>hrv</dc:language>
<dc:language>est</dc:language>
<dc:language>lav</dc:language>
<dc:language>rus</dc:language>
<dc:relation>info:eu-repo/grantAgreement/EC/H2020/825153</dc:relation>
<dc:relation>https://www.aclweb.org/anthology/2021.hackashop-1.4.pdf</dc:relation>
<dc:rights>Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)</dc:rights>
<dc:rights>https://creativecommons.org/licenses/by-nc-nd/4.0/</dc:rights>
<dc:rights>PUB</dc:rights>
<dc:format>text/plain; charset=utf-8</dc:format>
<dc:format>application/zip</dc:format>
<dc:format>downloadable_files_count: 1</dc:format>
<dc:publisher>Ekspress Meedia Group</dc:publisher>
<dc:publisher>Styria Media Group</dc:publisher>
<dc:source>http://embeddia.eu/</dc:source>
</oai_dc:dc>
</metadata></record></GetRecord></OAI-PMH>