Prikaži enostavni zapis vnosa

 
dc.contributor.author Pollak, Senja
dc.contributor.author Purver, Matthew
dc.contributor.author Shekhar, Ravi
dc.contributor.author Freienthal, Linda
dc.contributor.author Kuulmets, Hele-Andra
dc.contributor.author Krustok, Ivar
dc.date.accessioned 2021-05-24T09:18:45Z
dc.date.available 2021-05-24T09:18:45Z
dc.date.issued 2021-04-19
dc.identifier.uri http://hdl.handle.net/11356/1409
dc.description This dataset is an archive of articles from the Delfi news site from 2015-2019, containing over 180,000 articles (c. 50% in Latvian and 50% in the Russian language). Keywords for articles are included. There are 5 JSON files: lv_2015.json contains 42 001 articles from the year 2015 lv_2016_.json contains 40 342 articles from the year 2016 lv_2017_.json contains 37 256 articles from the year 2017 lv_2018_.json contains 31 732 articles from the year 2018 lv_2019_.json contains 29 070 articles from the year 2019 In sum: 180 401 articles Description of the dataset This JSON file is a list of dictionaries, i.e. each article is represented as a dictionary. Each dictionary contains the following: id (integer) - the ID of the article title (string) - the title of the article lead (string) - the lead of the article tags [1] (list of dictionaries or None): each dictionary represents one tag. The tag dictionary contains the following: domain_id (string) - the ID of the domain id (string) - the ID of the tag lang (string) - the language of the tag tag (string) - the tag itself, e.g. Šokolāde translitted_name (string) - a modified version of the tag, e.g. sokolade rawBody (string) - the raw text of the article (contains HTML) bodyText (string) - clean article text (stripped from HTML) publishDate (string) - published date & time of the article categoryPrimary (dictionary or empty list) - the dictionary contains the following information: categoryId (integer) - the ID of the category categoryName (string)- the name of the category (e.g. Futbols) channelId (integer) - the ID of the channel groupId - None channelLanguage (string) - the language of the channel (nat - Latvian, rus - Russian) categoryLanguage (integer) - ID of the channel language relatedArticles (list of integers or None) - a list of related articles' ID's relatedTags(string or None) -- related tags are comma-separated
dc.language.iso lav
dc.language.iso rus
dc.publisher Ekspress Meedia Group
dc.relation info:eu-repo/grantAgreement/EC/H2020/825153
dc.relation.isreferencedby https://www.aclweb.org/anthology/2021.hackashop-1.14.pdf
dc.rights Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights.label PUB
dc.source.uri http://embeddia.eu/
dc.subject news corpus
dc.subject latvian news article
dc.title Latvian Delfi article archive (in Latvian and Russian) 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Matthew Purver m.purver@qmul.ac.uk Queen Mary University
contact.person Ivar Krustok ivar.krustok@ekspressmeedia.ee Ekspress Meedia Group
sponsor European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
size.info 180401 texts
files.count 3
files.size 414591386


 Datoteke v tem vnosu

 Prenesi vse datoteke v vnosu (395.39 MB)
Icon
Ime
Readme.md
Velikost
3.84 KB
Format
Neznano
Opis
ReadMe
MD5
c7808363266a29d051a1c37446e9eb27
 Prenesi datoteko
Icon
Ime
lv_articles_entire_collection.zip
Velikost
241.19 MB
Format
application/zip
Opis
Articles
MD5
79ef9464190b2bab106a7846020e42da
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • readme.md2 kB
    • lv_2019.json212 MB
    • NDA_Latvia.pdf751 kB
    • lv_2018.json216 MB
    • lv_2017.json231 MB
    • lv_2016.json239 MB
    • lv_2015.json247 MB
Icon
Ime
lv_articles_2015_2019_lemmas.zip
Velikost
154.2 MB
Format
application/zip
Opis
Articles 2015-2019 Lemmas
MD5
7b5c7501ccab5ca7bb27a0c84339a272
 Prenesi datoteko  Predogled
 Predogled datoteke  
  • __MACOSX
    • ._lv_articles_2015_2019_lemmas-1 B
    • lv_articles_2015_2019_lemmas
      • ._lv_2018_articles_lemmas.jl-1 B
      • ._lv_2015_articles_lemmas.jl-1 B
      • ._readme.md-1 B
      • ._lv_2017_articles_lemmas.jl-1 B
      • ._lv_2019_articles_lemmas.jl-1 B
      • ._lv_2016_articles_lemmas.jl-1 B
  • lv_articles_2015_2019_lemmas
    • lv_2019_articles_lemmas.jl-1 B
    • lv_2016_articles_lemmas.jl-1 B
    • readme.md-1 B
    • lv_2018_articles_lemmas.jl-1 B
    • lv_2015_articles_lemmas.jl-1 B
    • lv_2017_articles_lemmas.jl-1 B

Prikaži enostavni zapis vnosa