The news dataset for discriminating between Bosnian, Croatian and Serbian SETimes.HBS 1.0

Name: The news dataset for discriminating between Bosnian, Croatian and Serbian SETimes.HBS 1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Ljubešić, Nikola; Rupnik, Peter

Show simple item record

dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Rupnik, Peter
dc.date.accessioned	2022-01-27T18:55:32Z
dc.date.available	2022-01-27T18:55:32Z
dc.date.issued	2022-01-26
dc.identifier.uri	http://hdl.handle.net/11356/1461
dc.description	The SETimes.HBS dataset consists of parallel documents written in Bosnian, Croatian and Serbian, harvested from the already inactive setimes.com website publishing news in the languages of South-Eastern Europe. While the writing process of the documents is not known, they are quite likely independent translations from English. The main intended usage of this dataset is closely-related language discrimination. This dataset is not a traditional parallel dataset as there are no explicit links between parallel documents. Special care was taken that the training, development and testing bins of the dataset contain the same documents in all three languages as data leakage between the three bins, given the similarity of the three languages, could be problematic for benchmarking.
dc.language.iso	bos
dc.language.iso	hrv
dc.language.iso	srp
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://aclanthology.org/C12-1160/
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.clarin.si/info/k-centre/
dc.subject	news corpus
dc.subject	language identification
dc.subject	closely related languages
dc.title	The news dataset for discriminating between Bosnian, Croatian and Serbian SETimes.HBS 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor	Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor	ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds
size.info	9258 texts
files.count	1
files.size	21132170