| dc.contributor.author | Pollak, Senja | 
| dc.contributor.author | Purver, Matthew | 
| dc.contributor.author | Shekhar, Ravi | 
| dc.contributor.author | Freienthal, Linda | 
| dc.contributor.author | Kuulmets, Hele-Andra | 
| dc.contributor.author | Krustok, Ivar | 
| dc.date.accessioned | 2021-05-24T09:18:45Z | 
| dc.date.available | 2021-05-24T09:18:45Z | 
| dc.date.issued | 2021-04-19 | 
| dc.identifier.uri | http://hdl.handle.net/11356/1409 | 
| dc.description | This dataset is an archive of articles from the Delfi news site from 2015-2019, containing over 180,000 articles (c. 50% in Latvian and 50% in the Russian language). Keywords for articles are included. There are 5 JSON files: lv_2015.json contains 42 001 articles from the year 2015 lv_2016_.json contains 40 342 articles from the year 2016 lv_2017_.json contains 37 256 articles from the year 2017 lv_2018_.json contains 31 732 articles from the year 2018 lv_2019_.json contains 29 070 articles from the year 2019 In sum: 180 401 articles Description of the dataset This JSON file is a list of dictionaries, i.e. each article is represented as a dictionary. Each dictionary contains the following: id (integer) - the ID of the article title (string) - the title of the article lead (string) - the lead of the article tags [1] (list of dictionaries or None): each dictionary represents one tag. The tag dictionary contains the following: domain_id (string) - the ID of the domain id (string) - the ID of the tag lang (string) - the language of the tag tag (string) - the tag itself, e.g. Šokolāde translitted_name (string) - a modified version of the tag, e.g. sokolade rawBody (string) - the raw text of the article (contains HTML) bodyText (string) - clean article text (stripped from HTML) publishDate (string) - published date & time of the article categoryPrimary (dictionary or empty list) - the dictionary contains the following information: categoryId (integer) - the ID of the category categoryName (string)- the name of the category (e.g. Futbols) channelId (integer) - the ID of the channel groupId - None channelLanguage (string) - the language of the channel (nat - Latvian, rus - Russian) categoryLanguage (integer) - ID of the channel language relatedArticles (list of integers or None) - a list of related articles' ID's relatedTags(string or None) -- related tags are comma-separated | 
| dc.language.iso | lav | 
| dc.language.iso | rus | 
| dc.publisher | Ekspress Meedia Group | 
| dc.relation | info:eu-repo/grantAgreement/EC/H2020/825153 | 
| dc.relation.isreferencedby | https://www.aclweb.org/anthology/2021.hackashop-1.14.pdf | 
| dc.rights | Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) | 
| dc.rights.uri | https://creativecommons.org/licenses/by-nc-nd/4.0/ | 
| dc.rights.label | PUB | 
| dc.source.uri | http://embeddia.eu/ | 
| dc.subject | news corpus | 
| dc.subject | latvian news article | 
| dc.title | Latvian Delfi article archive (in Latvian and Russian) 1.0 | 
| dc.type | corpus | 
| metashare.ResourceInfo#ContentInfo.mediaType | text | 
| has.files | yes | 
| branding | CLARIN.SI data & tools | 
| contact.person | Matthew Purver m.purver@qmul.ac.uk Queen Mary University | 
| contact.person | Ivar Krustok ivar.krustok@ekspressmeedia.ee Ekspress Meedia Group | 
| sponsor | European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153 | 
| size.info | 180401 texts | 
| files.count | 3 | 
| files.size | 414591386 | 
Datoteke v tem vnosu
Prenesi vse datoteke v vnosu (395.39 MB)To je vnos 
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
 
 
 
 
Publicly Available
 z licenco:Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
 
 
 
 
 
- Ime
- Readme.md
- Velikost
- 3.84 KB
- Format
- Neznano
- Opis
- ReadMe
- MD5
- c7808363266a29d051a1c37446e9eb27
 
- Ime
- lv_articles_entire_collection.zip
- Velikost
- 241.19 MB
- Format
- application/zip
- Opis
- Articles
- MD5
- 79ef9464190b2bab106a7846020e42da
 
- Ime
- lv_articles_2015_2019_lemmas.zip
- Velikost
- 154.2 MB
- Format
- application/zip
- Opis
- Articles 2015-2019 Lemmas
- MD5
- 7b5c7501ccab5ca7bb27a0c84339a272
- __MACOSX- ._lv_articles_2015_2019_lemmas-1 B
- lv_articles_2015_2019_lemmas- ._lv_2018_articles_lemmas.jl-1 B
- ._lv_2015_articles_lemmas.jl-1 B
- ._readme.md-1 B
- ._lv_2017_articles_lemmas.jl-1 B
- ._lv_2019_articles_lemmas.jl-1 B
- ._lv_2016_articles_lemmas.jl-1 B
 
 
- lv_articles_2015_2019_lemmas- lv_2019_articles_lemmas.jl-1 B
- lv_2016_articles_lemmas.jl-1 B
- readme.md-1 B
- lv_2018_articles_lemmas.jl-1 B
- lv_2015_articles_lemmas.jl-1 B
- lv_2017_articles_lemmas.jl-1 B
 
