Show simple item record

 
dc.contributor.author Purver, Matthew
dc.contributor.author Pollak, Senja
dc.contributor.author Freienthal, Linda
dc.contributor.author Kuulmets, Hele-Andra
dc.contributor.author Krustok, Ivar
dc.contributor.author Shekhar, Ravi
dc.date.accessioned 2021-05-24T07:51:50Z
dc.date.available 2021-05-24T07:51:50Z
dc.date.issued 2021-04-19
dc.identifier.uri http://hdl.handle.net/11356/1408
dc.description The dataset is an archive of articles from the Ekspress Meedia news site from 2009-2019, containing over 1.4M articles, mostly in Estonian language (1,115,120 articles) with some in Russian (325,952 articles). Keywords are included for articles after 2015. The main archive is in file ee_articles_2009_2019. Other files contain derived versions and subsets - please see README files inside those zip files. The main archive contains JSON files of all the Estonian articles from the year 2009 to 2019 May. These datasets are intended for usage in EMBEDDIA, a H2020 project. Articles are in Estonian language with some in Russian. The main archive is in file ee_*articles_*2009_2019. Other files contain derived versions and subsets (please see README files inside those zip files), in short: - eearticles2015-2019: This dataset contains Estonian and Russian articles - 5 years, with tags, that were missing in the previous versions. - files eearticles20152019lemmatized and eearticles20092014lemmatized are the files preprocessed by TEXTA (contact linda@texta.ee) - in file eeandsttarticlelemmasembeddingsand_usage you can find w2v embeddings trained by TEXTA (contact linda@texta.ee) Description of the Main Dataset (eearticles_2009_2019) There are 12 JSON files: articles_2009_ver2.json contains 161394 articles from the year 2009 articles_2010_ver2.json contains 151033 articles from the year 2010 articles_2011_ver2.json contains 168273 articles from the year 2011 articles_2012_ver2.json contains 152772 articles from the year 2012 articles_2013_ver2.json contains 141012 articles from the year 2013 articles_2014_ver2.json contains 128388 articles from the year 2014 articles_2015_ver2.json contains 127425 articles from the year 2015 articles_2016_ver2.json contains 130704 articles from the year 2016 articles_2017_ver2.json contains 119318 articles from the year 2017 articles_2018_ver2.json contains 117388 articles from the year 2018 articles_2019_Jan-Apr_ver2.json contains 35076 articles from the year 2019 January to April articles_2019_May_ver2.json contains 8329 articles from the year 2019 May In sum: 1 441 112 articles Each JSON file is a list of dictionaries, i.e. each article is represented as a dictionary. Each dictionary contains the following: id (integer) - the ID of the article title (string) - the title of the article lead (string) - the lead of the article (can contain HTML, e.g. <a> tag) url (string) - the URL of the article tags (list of dictionaries or None) [1]: each dictionary represents one tag. The tag dictionary contains the following: domain_id (string) [2] - the ID of the domain id (string) - the ID of the tag lang (string) - the language of the tag tag (string) - the tag itself, e.g. Kert Kingo (a name) translitted_name (string) - a modified version of the tag, e.g. kert-kingo rawBody (string) - the raw text of the article (contains HTML) bodyText (string) - clean article text (stripped from HTML) publishDate (string) - published date & time of the article categoryPrimary (dictionary or empty list) - the dictionary contains the following information: categoryId (integer) - the ID of the category categoryName (string)- the name of the category (e.g. World) channelId (integer) - the ID of the channel OR articleId (integer) - the ID of the article categoryId (integer) - the ID of the category categoryName (string)- the name of the category (e.g. World) categoryPrimary (boolean) - unknown categorySort (integer) - unknown categoryUrl (string) - the URL of the category categoryVisible (boolean) - unknown channelId (integer) - the ID of the channel channelUrl (string) - the URL of the channel (e.g. 'https://sport.delfi.ee') directoryName (string) - unknown parentId (integer) - unknown channelLanguage (string or None) [3] - the language of the channel categoryLanguage (int or None) [4] -unknown commentCount (int) [5] - the number of comments relatedArticles (list of integers) - a list of related articles' ID's
dc.language.iso est
dc.language.iso rus
dc.publisher Ekspress Meedia Group
dc.relation info:eu-repo/grantAgreement/EC/H2020/825153
dc.relation.isreferencedby https://www.aclweb.org/anthology/2021.hackashop-1.14.pdf
dc.rights Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights.label PUB
dc.source.uri http://embeddia.eu/
dc.subject news corpus
dc.subject lemmatisation
dc.subject word embeddings
dc.title Ekspress news article archive (in Estonian and Russian) 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Matthew Purver m.purver@qmul.ac.uk Queen Mary University
contact.person Ivar Krustok ivar.krustok@ekspressmeedia.ee Ekspress Meedia Group
contact.person Linda Freienthal linda@texta.ee Ekspress Meedia Group
sponsor European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
size.info 1441112 texts
files.count 6
files.size 2488315926


 Files in this item

Icon
Name
Readme.md
Size
2.96 KB
Format
Unknown
Description
ReadMe
MD5
d815e327a7a8c8035cbc4c6d78f236bb
 Download file
Icon
Name
ee_lemmas_embedding.zip
Size
132.77 MB
Format
application/zip
Description
Lemmas Embedding
MD5
d9ea518deee27784555a74a85ca26f29
 Download file  Preview
 File Preview  
  • ee_lemmas_embedding
    • ee_lemmas_embedding-1 B
    • model.json-1 B
    • readme.md-1 B
    • test_ee_lemmas_embedding.ipynb-1 B
Icon
Name
ee_articles_2009_2014_lemmatized.zip
Size
126.61 MB
Format
application/zip
Description
Lemmatized Articles 2009-2014
MD5
5395ce0901b670e637829f0425c3c3aa
 Download file  Preview
 File Preview  
  • ee_articles_2009_2014_lemmatized
    • ee_2014_articles_lemmas.jl-1 B
    • readme.md-1 B
    • ee_2011_articles_lemmas.jl-1 B
    • ee_2013_articles_lemmas.jl-1 B
    • ee_2009_articles_lemmas.jl-1 B
    • ee_2010_articles_lemmas.jl-1 B
    • ee_2012_articles_lemmas.jl-1 B
Icon
Name
ee_articles_2015_2019_lemmatized.zip
Size
170.59 MB
Format
application/zip
Description
Lemmatized Articles 2015-2019
MD5
f5b61fdef1483365e867a67bb3c7d424
 Download file  Preview
 File Preview  
  • ee_articles_2015_2019_lemmatized
    • ee_2019_articles_lemmas.jl-1 B
    • readme.md-1 B
    • ee_2016_articles_lemmas.jl-1 B
    • ee_2018_articles_lemmas.jl-1 B
    • ee_2015_articles_lemmas.jl-1 B
    • ee_2017_articles_lemmas.jl-1 B
Icon
Name
ee_articles_2015-2019.zip
Size
269.47 MB
Format
application/zip
Description
Articles 2015-2019
MD5
696f356229acca12cd83d82670af3abd
 Download file  Preview
 File Preview  
    • readme.md2 kB
    • ee_2019.json193 MB
    • ee_2018.json271 MB
    • ee_2017.json293 MB
    • ee_2016.json281 MB
    • ee_2015.json247 MB
Icon
Name
ee_articles_2009-2019.zip
Size
1.63 GB
Format
application/zip
Description
Articles 2009-2019
MD5
c83276775fa0eb918dad1b1d55784b98
 Download file  Preview
 File Preview  
    • articles_2013_ver2.json664 MB
    • articles_2012_ver2.json715 MB
    • articles_2016_ver2.json762 MB
    • readme.md5 kB
    • articles_2019_Jan-Apr_ver2.json246 MB
    • articles_2019_May_ver2.json57 MB
    • articles_2011_ver2.json728 MB
    • changelog.md883 B
    • articles_2015_ver2.json657 MB
    • articles_2018_ver2.json794 MB
    • articles_2009_ver2.json686 MB
    • articles_2010_ver2.json641 MB
    • articles_2014_ver2.json656 MB
    • articles_2017_ver2.json772 MB

Show simple item record