dc.contributor.author | Pollak, Senja |
dc.contributor.author | Purver, Matthew |
dc.contributor.author | Shekhar, Ravi |
dc.contributor.author | Freienthal, Linda |
dc.contributor.author | Kuulmets, Hele-Andra |
dc.contributor.author | Krustok, Ivar |
dc.date.accessioned | 2021-05-24T09:18:45Z |
dc.date.available | 2021-05-24T09:18:45Z |
dc.date.issued | 2021-04-19 |
dc.identifier.uri | http://hdl.handle.net/11356/1409 |
dc.description | This dataset is an archive of articles from the Delfi news site from 2015-2019, containing over 180,000 articles (c. 50% in Latvian and 50% in the Russian language). Keywords for articles are included. There are 5 JSON files: lv_2015.json contains 42 001 articles from the year 2015 lv_2016_.json contains 40 342 articles from the year 2016 lv_2017_.json contains 37 256 articles from the year 2017 lv_2018_.json contains 31 732 articles from the year 2018 lv_2019_.json contains 29 070 articles from the year 2019 In sum: 180 401 articles Description of the dataset This JSON file is a list of dictionaries, i.e. each article is represented as a dictionary. Each dictionary contains the following: id (integer) - the ID of the article title (string) - the title of the article lead (string) - the lead of the article tags [1] (list of dictionaries or None): each dictionary represents one tag. The tag dictionary contains the following: domain_id (string) - the ID of the domain id (string) - the ID of the tag lang (string) - the language of the tag tag (string) - the tag itself, e.g. Šokolāde translitted_name (string) - a modified version of the tag, e.g. sokolade rawBody (string) - the raw text of the article (contains HTML) bodyText (string) - clean article text (stripped from HTML) publishDate (string) - published date & time of the article categoryPrimary (dictionary or empty list) - the dictionary contains the following information: categoryId (integer) - the ID of the category categoryName (string)- the name of the category (e.g. Futbols) channelId (integer) - the ID of the channel groupId - None channelLanguage (string) - the language of the channel (nat - Latvian, rus - Russian) categoryLanguage (integer) - ID of the channel language relatedArticles (list of integers or None) - a list of related articles' ID's relatedTags(string or None) -- related tags are comma-separated |
dc.language.iso | lav |
dc.language.iso | rus |
dc.publisher | Ekspress Meedia Group |
dc.relation | info:eu-repo/grantAgreement/EC/H2020/825153 |
dc.relation.isreferencedby | https://www.aclweb.org/anthology/2021.hackashop-1.14.pdf |
dc.rights | Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-nd/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://embeddia.eu/ |
dc.subject | news corpus |
dc.subject | latvian news article |
dc.title | Latvian Delfi article archive (in Latvian and Russian) 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Matthew Purver m.purver@qmul.ac.uk Queen Mary University |
contact.person | Ivar Krustok ivar.krustok@ekspressmeedia.ee Ekspress Meedia Group |
sponsor | European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153 |
size.info | 180401 texts |
files.count | 3 |
files.size | 414591386 |
Files in this item
Download all files in item (395.39 MB)This item is
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)





- Name
- Readme.md
- Size
- 3.84 KB
- Format
- Unknown
- Description
- ReadMe
- MD5
- c7808363266a29d051a1c37446e9eb27

- Name
- lv_articles_entire_collection.zip
- Size
- 241.19 MB
- Format
- application/zip
- Description
- Articles
- MD5
- 79ef9464190b2bab106a7846020e42da

- Name
- lv_articles_2015_2019_lemmas.zip
- Size
- 154.2 MB
- Format
- application/zip
- Description
- Articles 2015-2019 Lemmas
- MD5
- 7b5c7501ccab5ca7bb27a0c84339a272
- __MACOSX
- ._lv_articles_2015_2019_lemmas-1 B
- lv_articles_2015_2019_lemmas
- ._lv_2018_articles_lemmas.jl-1 B
- ._lv_2015_articles_lemmas.jl-1 B
- ._readme.md-1 B
- ._lv_2017_articles_lemmas.jl-1 B
- ._lv_2019_articles_lemmas.jl-1 B
- ._lv_2016_articles_lemmas.jl-1 B
- lv_articles_2015_2019_lemmas
- lv_2019_articles_lemmas.jl-1 B
- lv_2016_articles_lemmas.jl-1 B
- readme.md-1 B
- lv_2018_articles_lemmas.jl-1 B
- lv_2015_articles_lemmas.jl-1 B
- lv_2017_articles_lemmas.jl-1 B