dc.contributor.author | Freienthal, Linda |
dc.contributor.author | Pelicon, Andraž |
dc.contributor.author | Martinc, Matej |
dc.contributor.author | Škrlj, Blaž |
dc.contributor.author | Krustok, Ivar |
dc.contributor.author | Pranjić, Marko |
dc.contributor.author | Cabrera-Diego, Luis Adrián |
dc.contributor.author | Purver, Matthew |
dc.contributor.author | Pollak, Senja |
dc.contributor.author | Kuulmets, Hele-Andra |
dc.contributor.author | Shekhar, Ravi |
dc.contributor.author | Koloski, Boshko |
dc.date.accessioned | 2022-02-24T07:56:04Z |
dc.date.available | 2022-02-24T07:56:04Z |
dc.date.issued | 2022-02-10 |
dc.identifier.uri | http://hdl.handle.net/11356/1485 |
dc.description | This dataset contains articles from EMBEDDIA Media partners with various information added by the tools developed within the EMBEDDIA project: - 12,390 Estonian articles from 2019 with tags given by Ekspress Meedia. The complete dataset without the output of EMBEDDIA tools is available at http://hdl.handle.net/11356/1408 - 5,000 Croatian articles from autumn of 2010 with tags given by 24sata. The complete dataset without the output of EMBEDDIA tools is available at http://hdl.handle.net/11356/1410 - 15,264 Latvian articles from 2019 with tags given by Ekspress Meedia. The complete dataset without the output of EMBEDDIA tools is available at http://hdl.handle.net/11356/1409 All the articles in the dataset have been analysed with texta-mlp Python package (https://pypi.org/project/texta-mlp/) via the EMBEDDIA Media assistant's Texta Toolkit (https://docs.texta.ee/). The tools used to analyse the articles were the following: - Latin1 and Latin2 Name Entity Recognition Tool modules (Cabrera-Diego et al., 2021, both described in https://aclanthology.org/2021.bsnlp-1.12/) . The Latin 1 results can be found folders annotated_articles_ner_latin1/ and annotated_articles_all_tools/, while the Latin 2 results are in annotated_articles_nerlatin2/ or annotated_articles_all_tools/. - RAKUN keyword extractor. RAKUN (Škrlj et al. 2019) is an unsupervised system for keyword extraction, so it can be used for any language. It detects keywords by turning text into a graph and the most important nodes in the graph mostly turn out to be the keywords. It is described in https://link.springer.com/chapter/10.1007/978-3-030-31372-2_26. The keyword annotation results can be found in the folder annotated_articles_rakun/ or annotated_articles_all_tools/. - TNT-KID keyword extractor. TNT-KID (Martinc et al. 2021, ) is a supervised system for automatic keyword extraction. It was trained on a corpus of articles with human-assigned keywords. For Croatian, the annotators were 24sata editors, for Estonian the Ekspress Meedia staff and for Latvian the Latvian Delfi staff. The system is further documented at https://doi.org/10.1017/S1351324921000127. For Croatian only TNT-KID was applied, while for Estonian and Latvian, the TNT-KID with TF-IDF, and extension by Koloski et al. (https://aclanthology.org/2021.hackashop-1.4.pdf) was used. The results of applying this tool are found in the folder annotated articles tnt_kid/ or annotated articles all tools/. - Sentiment analysis. Our news sentiment analyser (Pelicon et al. 2020) labels a news article as being of positive, negative, or neutral sentiment, using a fine-tuned multilingual BERT model, which was trained on Slovene sentiment annotated news articles. The system is further documented in https://doi.org/10.3390/app10175993. The results of this tools are found in the folder annotated articles sentiment/ or annotated articles all tools/. All the data is encoded in "JSON Lines" format. Each folder has its own README file which explains the structure of the files. |
dc.language.iso | est |
dc.language.iso | lav |
dc.language.iso | hrv |
dc.publisher | Ekspress Meedia Group |
dc.publisher | Styria Media Group |
dc.relation | info:eu-repo/grantAgreement/EC/H2020/825153 |
dc.rights | Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-nd/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://embeddia.eu/ |
dc.subject | keyword extraction |
dc.subject | named entity recognition |
dc.subject | sentiment classification |
dc.title | EMBEDDIA tools output example corpus of Estonian, Croatian and Latvian news articles 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
demo.uri | https://embeddia.texta.ee/ |
contact.person | Linda Freienthal linda@texta.ee TEXTA |
contact.person | Ivar Krustok ivar.krustok@ekspressmeedia.ee Ekspress Meedia Group |
contact.person | Marko Pranjić marko@entropia.hr Styria Media Group |
contact.person | Senja Pollak senja.pollak@ijs.si Jožef Stefan Institute |
sponsor | European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153 |
size.info | 32654 articles |
files.count | 1 |
files.size | 455374545 |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)





- Ime
- EMBEDDIA_tools_output.zip
- Velikost
- 434.28 MB
- Format
- application/zip
- Opis
- The dataset in JSON Lines format, includes README files
- MD5
- 2991cf4e8f309d2ad3e1e7c2d926952d
- EMBEDDIA_tools_output
- annotated_articles_ner_latin1
- README.md-1 B
- lv_2019_articles_lemmas_ner_Latin1.jsonl-1 B
- ee_2019_articles_lemmas_ner_Latin1.jsonl-1 B
- hr_styria_articles_lemmatized_ner_Latin1.jsonl-1 B
- README.md-1 B
- annotated_articles_rakun
- lv_2019_articles_lemmas.jsonl-1 B
- README.md-1 B
- hr_styria_articles_lemmatized.jsonl-1 B
- ee_2019_articles_lemmas.jsonl-1 B
- annotated_articles_tnt_kid
- lv_2019_articles_lemmas.jsonl-1 B
- README.md-1 B
- hr_styria_articles_lemmatized.jsonl-1 B
- ee_2019_articles_lemmas.jsonl-1 B
- annotated_articles_all_tools
- ee_all_tools_output.jsonl-1 B
- README.md-1 B
- hr_all_tools_output.jsonl-1 B
- lv_all_tools_output.jsonl-1 B
- annotated_articles_sentiment
- lv_2019_articles_lemmas.jsonl-1 B
- README.md-1 B
- hr_styria_articles_lemmatized.jsonl-1 B
- ee_2019_articles_lemmas.jsonl-1 B
- annotated_articles_ner_latin2
- hr_styria_articles_lemmatized_ner_Latin2.jsonl-1 B
- README.md-1 B
- lv_2019_articles_lemmas_ner_Latin2.jsonl-1 B
- ee_2019_articles_lemmas_ner_Latin2.jsonl-1 B
- annotated_articles_ner_latin1