<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href='static/style.xsl' type='text/xsl'?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2026-05-21T20:37:17Z</responseDate><request verb="GetRecord" identifier="oai:www.clarin.si:11356/1485" metadataPrefix="oai_dc">http://www.clarin.si/repository/oai/request</request><GetRecord><record><header><identifier>oai:www.clarin.si:11356/1485</identifier><datestamp>2022-02-24T07:56:04Z</datestamp><setSpec>hdl_11356_1023</setSpec><setSpec>hdl_11356_1024</setSpec></header><metadata><oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>EMBEDDIA tools output example corpus of Estonian, Croatian and Latvian news articles 1.0</dc:title>
<dc:creator>Freienthal, Linda</dc:creator>
<dc:creator>Pelicon, Andraž</dc:creator>
<dc:creator>Martinc, Matej</dc:creator>
<dc:creator>Škrlj, Blaž</dc:creator>
<dc:creator>Krustok, Ivar</dc:creator>
<dc:creator>Pranjić, Marko</dc:creator>
<dc:creator>Cabrera-Diego, Luis Adrián</dc:creator>
<dc:creator>Purver, Matthew</dc:creator>
<dc:creator>Pollak, Senja</dc:creator>
<dc:creator>Kuulmets, Hele-Andra</dc:creator>
<dc:creator>Shekhar, Ravi</dc:creator>
<dc:creator>Koloski, Boshko</dc:creator>
<dc:subject>keyword extraction</dc:subject>
<dc:subject>named entity recognition</dc:subject>
<dc:subject>sentiment classification</dc:subject>
<dc:description>This dataset contains articles from EMBEDDIA Media partners with various information added by the tools developed within the EMBEDDIA project:&#xd;
- 12,390 Estonian articles from 2019 with tags given by Ekspress Meedia. The complete dataset without the output of EMBEDDIA tools is available at http://hdl.handle.net/11356/1408&#xd;
- 5,000 Croatian articles from autumn of 2010 with tags given by 24sata. The complete dataset without the output of EMBEDDIA tools is available at http://hdl.handle.net/11356/1410&#xd;
- 15,264 Latvian articles from 2019 with tags given by Ekspress Meedia. The complete dataset without the output of EMBEDDIA tools is available at http://hdl.handle.net/11356/1409&#xd;
&#xd;
All the articles in the dataset have been analysed with texta-mlp Python package (https://pypi.org/project/texta-mlp/) via the EMBEDDIA Media assistant's Texta Toolkit (https://docs.texta.ee/). The tools used to analyse the articles were the following:&#xd;
&#xd;
- Latin1 and Latin2 Name Entity Recognition Tool modules (Cabrera-Diego et al., 2021, both described in https://aclanthology.org/2021.bsnlp-1.12/) . The Latin 1 results can be found folders annotated_articles_ner_latin1/ and annotated_articles_all_tools/, while the Latin 2 results are in annotated_articles_nerlatin2/ or annotated_articles_all_tools/.&#xd;
&#xd;
- RAKUN keyword extractor. RAKUN (Škrlj et al. 2019) is an unsupervised system for keyword extraction, so it can be used for any language. It detects keywords by turning text into a graph and the most important nodes in the graph mostly turn out to be the keywords. It is described in https://link.springer.com/chapter/10.1007/978-3-030-31372-2_26. The keyword annotation results can be found in the folder annotated_articles_rakun/ or annotated_articles_all_tools/.&#xd;
&#xd;
- TNT-KID keyword extractor. TNT-KID (Martinc et al. 2021, ) is a supervised system for automatic keyword extraction. It was trained on a corpus of articles with human-assigned keywords. For Croatian, the annotators were 24sata editors, for Estonian the Ekspress Meedia staff and for Latvian the Latvian Delfi staff. The system is further documented at https://doi.org/10.1017/S1351324921000127. For Croatian only TNT-KID was applied, while for Estonian and Latvian, the TNT-KID with TF-IDF, and extension by Koloski et al. (https://aclanthology.org/2021.hackashop-1.4.pdf) was used. The results of applying this tool are found in the folder annotated articles tnt_kid/ or annotated articles all tools/.&#xd;
&#xd;
- Sentiment analysis. Our news sentiment analyser (Pelicon et al. 2020) labels a news article as being of positive, negative, or neutral sentiment, using a fine-tuned multilingual BERT model, which was trained on Slovene sentiment annotated news articles. The system is further documented in https://doi.org/10.3390/app10175993. The results of this tools are found in the folder annotated articles sentiment/ or annotated articles all tools/.&#xd;
&#xd;
All the data is encoded in "JSON Lines" format. Each folder has its own README file which explains the structure of the files.</dc:description>
<dc:date>2022-02-10</dc:date>
<dc:type>corpus</dc:type>
<dc:identifier>http://hdl.handle.net/11356/1485</dc:identifier>
<dc:language>est</dc:language>
<dc:language>lav</dc:language>
<dc:language>hrv</dc:language>
<dc:relation>info:eu-repo/grantAgreement/EC/H2020/825153</dc:relation>
<dc:rights>Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)</dc:rights>
<dc:rights>https://creativecommons.org/licenses/by-nc-nd/4.0/</dc:rights>
<dc:rights>PUB</dc:rights>
<dc:format>text/plain; charset=utf-8</dc:format>
<dc:format>application/zip</dc:format>
<dc:format>downloadable_files_count: 1</dc:format>
<dc:publisher>Ekspress Meedia Group</dc:publisher>
<dc:publisher>Styria Media Group</dc:publisher>
<dc:source>http://embeddia.eu/</dc:source>
</oai_dc:dc>
</metadata></record></GetRecord></OAI-PMH>