Prikaži enostavni zapis vnosa

 
dc.contributor.author Novak, Erik
dc.contributor.author Calcina, Erik
dc.contributor.author Mladenić, Dunja
dc.contributor.author Grobelnik, Marko
dc.date.accessioned 2024-02-27T12:38:45Z
dc.date.available 2024-02-27T12:38:45Z
dc.date.issued 2024-02-27
dc.identifier.uri http://hdl.handle.net/11356/1922
dc.description The OG2021 corpus contains multilingual news articles that are reporting on the events happening during the 2021 Tokyo Olympics. The data set was created to evaluate the clustering algorithm. The articles were initially acquired via the EventRegistry service, clustered using an online news clustering algorithm, and finally manually inspected and annotated by a single evaluator using translation services to understand the meaning of the articles' content. The corpus consists of a single file called og2021.csv, which contains the data of 10.940 news articles grouped into 1.350 clusters. Each article has the following attributes: - id: The ID of the news article. - title: The title of the article. - lang: The language in which the article is written. Can be one of nine values. - source: The news publisher's name. - published_at: The date and time when the article was published. The published dates range between 2021-07-01 and 2021-08-14. - URL: The URL location of the news article. - cluster_id: The ID of the cluster the article is a member of. The dataset is also published with the body attribute but under a more restrictive licence. It can be found at http://hdl.handle.net/11356/1921.
dc.language.iso eng
dc.language.iso por
dc.language.iso spa
dc.language.iso fra
dc.language.iso rus
dc.language.iso deu
dc.language.iso slv
dc.language.iso ara
dc.language.iso zho
dc.publisher Jožef Stefan Institute
dc.relation info:eu-repo/grantAgreement/EC/H2020/952026
dc.rights Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights.label PUB
dc.source.uri https://www.humane-ai.eu/
dc.subject news corpus
dc.subject clustering
dc.subject evaluation
dc.title The news articles reporting on the 2021 Tokyo Olympics data set OG2021 (public)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
demo.uri https://github.com/E3-JSI/dataset-OG2021
contact.person Erik Novak erik.novak@ijs.si Jožef Stefan Institute
sponsor European Union EC/H2020/952026 HumanE-AI-Net - HumanE AI Network euFunds info:eu-repo/grantAgreement/EC/H2020/952026
size.info 10940 articles
files.count 1
files.size 856636


 Datoteke v tem vnosu

Icon
Ime
OG2021.zip
Velikost
836.56 KB
Format
application/zip
Opis
OG2021 corpus
MD5
48d25e1fd99f73620347382b86ce24eb
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • README.txt1 kB
    • og2021.csv2 MB

Prikaži enostavni zapis vnosa