dc.contributor.author | Shekhar, Ravi |
dc.contributor.author | Purver, Matthew |
dc.contributor.author | Pollak, Senja |
dc.contributor.author | Pelicon, Andraž |
dc.contributor.author | Krustok, Ivar |
dc.date.accessioned | 2021-05-24T09:19:51Z |
dc.date.available | 2021-05-24T09:19:51Z |
dc.date.issued | 2021-04-19 |
dc.identifier.uri | http://hdl.handle.net/11356/1407 |
dc.description | The dataset is an archive of reader comments from the Delfi news site from 2014-2019, containing approximately 12M comments, mostly in the Latvian language, with some in Russian. Description of the Datasets There are 6 CSV files: * ``lv-comments-2014.csv`` contains **2 753 655** comments from year 2014 * ``lv-comments-2015.csv`` contains **2 221 122** comments from year 2015 * ``lv-comments-2016.csv`` contains **1 897 669** comments from year 2016 * ``lv-comments-2017.csv`` contains **1 896 083** comments from year 2017 * ``lv-comments-2018.csv`` contains **2 222 051** comments from year 2018 * ``lv-comments-2019.csv`` contains **1 421 883** comments from year 2019 **In sum: 12 412 463 comments** Columns: * ``comment_id`` (string) - the ID of the written comment * ``article_id`` (string) - the ID of the article for which the comment was written * ``created_time`` (string) - the time and date of the comment * ``subject`` (string) - the title of the comment * ``reply_to_comment_id`` (string) - the parent comments ID * ``content`` (string) - the comment itself * ``is_anonymous`` (string) - * 1 if the comment was published anonymously * 0 if the comment was published by a registered user * ``is_enabled`` (string) - * 1 if the comment was published (online) * 0 if it wasn’t published * Questionable field: not all have been manually moderated * No additional information from the moderators * ``channel_language`` (string) - the language of the channel * 'nat' for Latvian * 'rus' for Russian * ``create_user_id`` (string) - the user ID of the commentator * ``modereted_by`` (string) - the ID of the moderator |
dc.language.iso | lav |
dc.language.iso | rus |
dc.publisher | Ekspress Meedia Group |
dc.relation | info:eu-repo/grantAgreement/EC/H2020/825153 |
dc.relation.isreferencedby | https://www.aclweb.org/anthology/2021.hackashop-1.14.pdf |
dc.rights | Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-nd/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://embeddia.eu/ |
dc.subject | user comment |
dc.subject | offensive language |
dc.subject | comment moderation |
dc.title | Latvian user comment dataset 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Ravi Shekhar r.shekhar@qmul.ac.uk Queen Mary University |
contact.person | Purver Matthew m.purver@qmul.ac.uk Queen Mary University |
contact.person | Ivar Krustok ivar.krustok@ekspressmeedia.ee Ekspress Meedia Group |
sponsor | European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153 |
size.info | 12412463 texts |
files.count | 7 |
files.size | 3938642619 |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)





- Ime
- Readme.md
- Velikost
- 2.96 KB
- Format
- Neznano
- Opis
- ReadMe
- MD5
- d815e327a7a8c8035cbc4c6d78f236bb

- Ime
- lv-comments-2018.csv
- Velikost
- 606.23 MB
- Format
- Datoteka CSV
- Opis
- 2018 Data
- MD5
- cd51401e84206a1b6194601c161acfb8

- Ime
- lv-comments-2016.csv
- Velikost
- 562.37 MB
- Format
- Datoteka CSV
- Opis
- 2016 Data
- MD5
- ae6b5e8a79fc4169b725ef2464db6c82

- Ime
- lv-comments-2015.csv
- Velikost
- 689.77 MB
- Format
- Datoteka CSV
- Opis
- 2015 Data
- MD5
- afc38cbc13816add34c2968f210836b9

- Ime
- lv-comments-2014.csv
- Velikost
- 964.35 MB
- Format
- Datoteka CSV
- Opis
- 2014 Data
- MD5
- d950a1b6555cca0db2e44f2ab7aa585f

- Ime
- lv-comments-2019.csv
- Velikost
- 392.62 MB
- Format
- Datoteka CSV
- Opis
- 2019 Data
- MD5
- 51b30d1c2742432f2d8a9c9ea33dbb27

- Ime
- lv-comments-2017.csv
- Velikost
- 540.84 MB
- Format
- Datoteka CSV
- Opis
- 2017 Data
- MD5
- 15a4aeef28e92d340ea29b0904db75f8