EMBEDDIA tools output example corpus of Estonian, Croatian and Latvian news articles 1.0

Name: EMBEDDIA tools output example corpus of Estonian, Croatian and Latvian news articles 1.0
License: https://creativecommons.org/licenses/by-nc-nd/4.0/

Freienthal, Linda; Pelicon, Andraž; Martinc, Matej; Škrlj, Blaž; Krustok, Ivar; Pranjić, Marko; Cabrera-Diego, Luis Adrián; Purver, Matthew; Pollak, Senja; Kuulmets, Hele-Andra; Shekhar, Ravi; Koloski, Boshko

Show simple item record

dc.contributor.author	Freienthal, Linda
dc.contributor.author	Pelicon, Andraž
dc.contributor.author	Martinc, Matej
dc.contributor.author	Škrlj, Blaž
dc.contributor.author	Krustok, Ivar
dc.contributor.author	Pranjić, Marko
dc.contributor.author	Cabrera-Diego, Luis Adrián
dc.contributor.author	Purver, Matthew
dc.contributor.author	Pollak, Senja
dc.contributor.author	Kuulmets, Hele-Andra
dc.contributor.author	Shekhar, Ravi
dc.contributor.author	Koloski, Boshko
dc.date.accessioned	2022-02-24T07:56:04Z
dc.date.available	2022-02-24T07:56:04Z
dc.date.issued	2022-02-10
dc.identifier.uri	http://hdl.handle.net/11356/1485
dc.description	This dataset contains articles from EMBEDDIA Media partners with various information added by the tools developed within the EMBEDDIA project: - 12,390 Estonian articles from 2019 with tags given by Ekspress Meedia. The complete dataset without the output of EMBEDDIA tools is available at http://hdl.handle.net/11356/1408 - 5,000 Croatian articles from autumn of 2010 with tags given by 24sata. The complete dataset without the output of EMBEDDIA tools is available at http://hdl.handle.net/11356/1410 - 15,264 Latvian articles from 2019 with tags given by Ekspress Meedia. The complete dataset without the output of EMBEDDIA tools is available at http://hdl.handle.net/11356/1409 All the articles in the dataset have been analysed with texta-mlp Python package (https://pypi.org/project/texta-mlp/) via the EMBEDDIA Media assistant's Texta Toolkit (https://docs.texta.ee/). The tools used to analyse the articles were the following: - Latin1 and Latin2 Name Entity Recognition Tool modules (Cabrera-Diego et al., 2021, both described in https://aclanthology.org/2021.bsnlp-1.12/) . The Latin 1 results can be found folders annotated_articles_ner_latin1/ and annotated_articles_all_tools/, while the Latin 2 results are in annotated_articles_nerlatin2/ or annotated_articles_all_tools/. - RAKUN keyword extractor. RAKUN (Škrlj et al. 2019) is an unsupervised system for keyword extraction, so it can be used for any language. It detects keywords by turning text into a graph and the most important nodes in the graph mostly turn out to be the keywords. It is described in https://link.springer.com/chapter/10.1007/978-3-030-31372-2_26. The keyword annotation results can be found in the folder annotated_articles_rakun/ or annotated_articles_all_tools/. - TNT-KID keyword extractor. TNT-KID (Martinc et al. 2021, ) is a supervised system for automatic keyword extraction. It was trained on a corpus of articles with human-assigned keywords. For Croatian, the annotators were 24sata editors, for Estonian the Ekspress Meedia staff and for Latvian the Latvian Delfi staff. The system is further documented at https://doi.org/10.1017/S1351324921000127. For Croatian only TNT-KID was applied, while for Estonian and Latvian, the TNT-KID with TF-IDF, and extension by Koloski et al. (https://aclanthology.org/2021.hackashop-1.4.pdf) was used. The results of applying this tool are found in the folder annotated articles tnt_kid/ or annotated articles all tools/. - Sentiment analysis. Our news sentiment analyser (Pelicon et al. 2020) labels a news article as being of positive, negative, or neutral sentiment, using a fine-tuned multilingual BERT model, which was trained on Slovene sentiment annotated news articles. The system is further documented in https://doi.org/10.3390/app10175993. The results of this tools are found in the folder annotated articles sentiment/ or annotated articles all tools/. All the data is encoded in "JSON Lines" format. Each folder has its own README file which explains the structure of the files.
dc.language.iso	est
dc.language.iso	lav
dc.language.iso	hrv
dc.publisher	Ekspress Meedia Group
dc.publisher	Styria Media Group
dc.relation	info:eu-repo/grantAgreement/EC/H2020/825153
dc.rights	Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights.label	PUB
dc.source.uri	http://embeddia.eu/
dc.subject	keyword extraction
dc.subject	named entity recognition
dc.subject	sentiment classification
dc.title	EMBEDDIA tools output example corpus of Estonian, Croatian and Latvian news articles 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
demo.uri	https://embeddia.texta.ee/
contact.person	Linda Freienthal linda@texta.ee TEXTA
contact.person	Ivar Krustok ivar.krustok@ekspressmeedia.ee Ekspress Meedia Group
contact.person	Marko Pranjić marko@entropia.hr Styria Media Group
contact.person	Senja Pollak senja.pollak@ijs.si Jožef Stefan Institute
sponsor	European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
size.info	32654 articles
files.count	1
files.size	455374545

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

Name: EMBEDDIA_tools_output.zip
Size: 434.28 MB
Format: application/zip
Description: The dataset in JSON Lines format, includes README files
MD5: 2991cf4e8f309d2ad3e1e7c2d926952d

Download file Preview

File Preview

EMBEDDIA_tools_output
- annotated_articles_ner_latin1
  - README.md-1 B
  - lv_2019_articles_lemmas_ner_Latin1.jsonl-1 B
  - ee_2019_articles_lemmas_ner_Latin1.jsonl-1 B
  - hr_styria_articles_lemmatized_ner_Latin1.jsonl-1 B
- README.md-1 B
- annotated_articles_rakun
  - lv_2019_articles_lemmas.jsonl-1 B
  - README.md-1 B
  - hr_styria_articles_lemmatized.jsonl-1 B
  - ee_2019_articles_lemmas.jsonl-1 B
- annotated_articles_tnt_kid
  - lv_2019_articles_lemmas.jsonl-1 B
  - README.md-1 B
  - hr_styria_articles_lemmatized.jsonl-1 B
  - ee_2019_articles_lemmas.jsonl-1 B
- annotated_articles_all_tools
  - ee_all_tools_output.jsonl-1 B
  - README.md-1 B
  - hr_all_tools_output.jsonl-1 B
  - lv_all_tools_output.jsonl-1 B
- annotated_articles_sentiment
  - lv_2019_articles_lemmas.jsonl-1 B
  - README.md-1 B
  - hr_styria_articles_lemmatized.jsonl-1 B
  - ee_2019_articles_lemmas.jsonl-1 B
- annotated_articles_ner_latin2
  - hr_styria_articles_lemmatized_ner_Latin2.jsonl-1 B
  - README.md-1 B
  - lv_2019_articles_lemmas_ner_Latin2.jsonl-1 B
  - ee_2019_articles_lemmas_ner_Latin2.jsonl-1 B

Show simple item record

Files in this item

Partners

Partners

Repository