Show simple item record

 
dc.contributor.author Freienthal, Linda
dc.contributor.author Pelicon, Andraž
dc.contributor.author Martinc, Matej
dc.contributor.author Škrlj, Blaž
dc.contributor.author Krustok, Ivar
dc.contributor.author Pranjić, Marko
dc.contributor.author Cabrera-Diego, Luis Adrián
dc.contributor.author Purver, Matthew
dc.contributor.author Pollak, Senja
dc.contributor.author Kuulmets, Hele-Andra
dc.contributor.author Shekhar, Ravi
dc.contributor.author Koloski, Boshko
dc.date.accessioned 2022-02-24T07:56:04Z
dc.date.available 2022-02-24T07:56:04Z
dc.date.issued 2022-02-10
dc.identifier.uri http://hdl.handle.net/11356/1485
dc.description This dataset contains articles from EMBEDDIA Media partners with various information added by the tools developed within the EMBEDDIA project: - 12,390 Estonian articles from 2019 with tags given by Ekspress Meedia. The complete dataset without the output of EMBEDDIA tools is available at http://hdl.handle.net/11356/1408 - 5,000 Croatian articles from autumn of 2010 with tags given by 24sata. The complete dataset without the output of EMBEDDIA tools is available at http://hdl.handle.net/11356/1410 - 15,264 Latvian articles from 2019 with tags given by Ekspress Meedia. The complete dataset without the output of EMBEDDIA tools is available at http://hdl.handle.net/11356/1409 All the articles in the dataset have been analysed with texta-mlp Python package (https://pypi.org/project/texta-mlp/) via the EMBEDDIA Media assistant's Texta Toolkit (https://docs.texta.ee/). The tools used to analyse the articles were the following: - Latin1 and Latin2 Name Entity Recognition Tool modules (Cabrera-Diego et al., 2021, both described in https://aclanthology.org/2021.bsnlp-1.12/) . The Latin 1 results can be found folders annotated_articles_ner_latin1/ and annotated_articles_all_tools/, while the Latin 2 results are in annotated_articles_nerlatin2/ or annotated_articles_all_tools/. - RAKUN keyword extractor. RAKUN (Škrlj et al. 2019) is an unsupervised system for keyword extraction, so it can be used for any language. It detects keywords by turning text into a graph and the most important nodes in the graph mostly turn out to be the keywords. It is described in https://link.springer.com/chapter/10.1007/978-3-030-31372-2_26. The keyword annotation results can be found in the folder annotated_articles_rakun/ or annotated_articles_all_tools/. - TNT-KID keyword extractor. TNT-KID (Martinc et al. 2021, ) is a supervised system for automatic keyword extraction. It was trained on a corpus of articles with human-assigned keywords. For Croatian, the annotators were 24sata editors, for Estonian the Ekspress Meedia staff and for Latvian the Latvian Delfi staff. The system is further documented at https://doi.org/10.1017/S1351324921000127. For Croatian only TNT-KID was applied, while for Estonian and Latvian, the TNT-KID with TF-IDF, and extension by Koloski et al. (https://aclanthology.org/2021.hackashop-1.4.pdf) was used. The results of applying this tool are found in the folder annotated articles tnt_kid/ or annotated articles all tools/. - Sentiment analysis. Our news sentiment analyser (Pelicon et al. 2020) labels a news article as being of positive, negative, or neutral sentiment, using a fine-tuned multilingual BERT model, which was trained on Slovene sentiment annotated news articles. The system is further documented in https://doi.org/10.3390/app10175993. The results of this tools are found in the folder annotated articles sentiment/ or annotated articles all tools/. All the data is encoded in "JSON Lines" format. Each folder has its own README file which explains the structure of the files.
dc.language.iso est
dc.language.iso lav
dc.language.iso hrv
dc.publisher Ekspress Meedia Group
dc.publisher Styria Media Group
dc.relation info:eu-repo/grantAgreement/EC/H2020/825153
dc.rights Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights.label PUB
dc.source.uri http://embeddia.eu/
dc.subject keyword extraction
dc.subject named entity recognition
dc.subject sentiment classification
dc.title EMBEDDIA tools output example corpus of Estonian, Croatian and Latvian news articles 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
demo.uri https://embeddia.texta.ee/
contact.person Linda Freienthal linda@texta.ee TEXTA
contact.person Ivar Krustok ivar.krustok@ekspressmeedia.ee Ekspress Meedia Group
contact.person Marko Pranjić marko@entropia.hr Styria Media Group
contact.person Senja Pollak senja.pollak@ijs.si Jožef Stefan Institute
sponsor European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
size.info 32654 articles
files.count 1
files.size 455374545


 Files in this item

Icon
Name
EMBEDDIA_tools_output.zip
Size
434.28 MB
Format
application/zip
Description
The dataset in JSON Lines format, includes README files
MD5
2991cf4e8f309d2ad3e1e7c2d926952d
 Download file  Preview
 File Preview  
  • EMBEDDIA_tools_output
    • annotated_articles_ner_latin1
      • README.md-1 B
      • lv_2019_articles_lemmas_ner_Latin1.jsonl-1 B
      • ee_2019_articles_lemmas_ner_Latin1.jsonl-1 B
      • hr_styria_articles_lemmatized_ner_Latin1.jsonl-1 B
    • README.md-1 B
    • annotated_articles_rakun
      • lv_2019_articles_lemmas.jsonl-1 B
      • README.md-1 B
      • hr_styria_articles_lemmatized.jsonl-1 B
      • ee_2019_articles_lemmas.jsonl-1 B
    • annotated_articles_tnt_kid
      • lv_2019_articles_lemmas.jsonl-1 B
      • README.md-1 B
      • hr_styria_articles_lemmatized.jsonl-1 B
      • ee_2019_articles_lemmas.jsonl-1 B
    • annotated_articles_all_tools
      • ee_all_tools_output.jsonl-1 B
      • README.md-1 B
      • hr_all_tools_output.jsonl-1 B
      • lv_all_tools_output.jsonl-1 B
    • annotated_articles_sentiment
      • lv_2019_articles_lemmas.jsonl-1 B
      • README.md-1 B
      • hr_styria_articles_lemmatized.jsonl-1 B
      • ee_2019_articles_lemmas.jsonl-1 B
    • annotated_articles_ner_latin2
      • hr_styria_articles_lemmatized_ner_Latin2.jsonl-1 B
      • README.md-1 B
      • lv_2019_articles_lemmas_ner_Latin2.jsonl-1 B
      • ee_2019_articles_lemmas_ner_Latin2.jsonl-1 B

Show simple item record