dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Rupnik, Peter |
dc.date.accessioned | 2022-01-27T18:55:32Z |
dc.date.available | 2022-01-27T18:55:32Z |
dc.date.issued | 2022-01-26 |
dc.identifier.uri | http://hdl.handle.net/11356/1461 |
dc.description | The SETimes.HBS dataset consists of parallel documents written in Bosnian, Croatian and Serbian, harvested from the already inactive setimes.com website publishing news in the languages of South-Eastern Europe. While the writing process of the documents is not known, they are quite likely independent translations from English. The main intended usage of this dataset is closely-related language discrimination. This dataset is not a traditional parallel dataset as there are no explicit links between parallel documents. Special care was taken that the training, development and testing bins of the dataset contain the same documents in all three languages as data leakage between the three bins, given the similarity of the three languages, could be problematic for benchmarking. |
dc.language.iso | bos |
dc.language.iso | hrv |
dc.language.iso | srp |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://aclanthology.org/C12-1160/ |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.clarin.si/info/k-centre/ |
dc.subject | news corpus |
dc.subject | language identification |
dc.subject | closely related languages |
dc.title | The news dataset for discriminating between Bosnian, Croatian and Serbian SETimes.HBS 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other |
sponsor | ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds |
size.info | 9258 texts |
files.count | 1 |
files.size | 21132170 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
- Name
- SETimes.HBS.zip
- Size
- 20.15 MB
- Format
- application/zip
- Description
- Dataset archive
- MD5
- f0ef513a161d6120793e9271a7340f6f