Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Rupnik, Peter
dc.date.accessioned 2022-01-27T18:55:32Z
dc.date.available 2022-01-27T18:55:32Z
dc.date.issued 2022-01-26
dc.identifier.uri http://hdl.handle.net/11356/1461
dc.description The SETimes.HBS dataset consists of parallel documents written in Bosnian, Croatian and Serbian, harvested from the already inactive setimes.com website publishing news in the languages of South-Eastern Europe. While the writing process of the documents is not known, they are quite likely independent translations from English. The main intended usage of this dataset is closely-related language discrimination. This dataset is not a traditional parallel dataset as there are no explicit links between parallel documents. Special care was taken that the training, development and testing bins of the dataset contain the same documents in all three languages as data leakage between the three bins, given the similarity of the three languages, could be problematic for benchmarking.
dc.language.iso bos
dc.language.iso hrv
dc.language.iso srp
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://aclanthology.org/C12-1160/
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.clarin.si/info/k-centre/
dc.subject news corpus
dc.subject language identification
dc.subject closely related languages
dc.title The news dataset for discriminating between Bosnian, Croatian and Serbian SETimes.HBS 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds
size.info 9258 texts
files.count 1
files.size 21132170


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
SETimes.HBS.zip
Size
20.15 MB
Format
application/zip
Description
Dataset archive
MD5
f0ef513a161d6120793e9271a7340f6f
 Download file  Preview
 File Preview  
    • SETimes.HBS.json57 MB
    • SETimes.HBS.txt585 B

Show simple item record