Show simple item record

 
dc.contributor.author Novak, Erik
dc.contributor.author Calcina, Erik
dc.contributor.author Mladenić, Dunja
dc.contributor.author Grobelnik, Marko
dc.date.accessioned 2024-02-27T12:37:58Z
dc.date.available 2024-02-27T12:37:58Z
dc.date.issued 2024-02-27
dc.identifier.uri http://hdl.handle.net/11356/1921
dc.description The OG2021 corpus contains multilingual news articles that are reporting on the events happening during the 2021 Tokyo Olympics. The data set was created to evaluate the clustering algorithm. The articles were initially acquired via the EventRegistry service, clustered using an online news clustering algorithm, and finally manually inspected and annotated by a single evaluator using translation services to understand the meaning of the articles' content. The corpus consists of a single file called og2021.csv, which contains the data of 10.940 news articles grouped into 1.350 clusters. Each article has the following attributes: - id: The ID of the news article. - title: The title of the article. - body: The body of the article. - lang: The language in which the article is written. Can be one of nine values. - source: The news publisher's name. - published_at: The date and time when the article was published. The published dates range between 2021-07-01 and 2021-08-14. - URL: The URL location of the news article. - cluster_id: The ID of the cluster the article is a member of.
dc.language.iso eng
dc.language.iso por
dc.language.iso spa
dc.language.iso fra
dc.language.iso rus
dc.language.iso deu
dc.language.iso slv
dc.language.iso ara
dc.language.iso zho
dc.publisher Jožef Stefan Institute
dc.relation info:eu-repo/grantAgreement/EC/H2020/952026
dc.rights CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0
dc.rights.uri https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0
dc.rights.label ACA
dc.source.uri https://www.humane-ai.eu/
dc.subject news corpus
dc.subject clustering
dc.subject evaluation
dc.title The news articles reporting on the 2021 Tokyo Olympics data set OG2021 (research)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
demo.uri https://github.com/E3-JSI/dataset-OG2021
contact.person Erik Novak erik.novak@ijs.si Jožef Stefan Institute
sponsor European Union EC/H2020/952026 HumanE-AI-Net - HumanE AI Network euFunds info:eu-repo/grantAgreement/EC/H2020/952026
size.info 10940 articles
files.count 1
files.size 18331312


 Files in this item

This item is
Academic Use
and licensed under:
CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0
Inform Before Use Attribution Required Noncommercial
Icon
Name
OG2021.zip
Size
17.48 MB
Format
application/zip
Description
OG2021 corpus
MD5
d1c1330e3e0b6e13b61ed195c6a35fda
 Download file

Show simple item record