dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Rupnik, Peter |
dc.date.accessioned | 2022-01-27T18:55:44Z |
dc.date.available | 2022-01-27T18:55:44Z |
dc.date.issued | 2022-01-26 |
dc.identifier.uri | http://hdl.handle.net/11356/1482 |
dc.description | The Twitter-HBS dataset consists of Twitter users, their tweets, and the label of their predominantly used language - Bosnian, Croatian, Montenegrin, or Serbian. Among the tweets, there are also tweets in other languages (mainly English) as the label encodes the predominantly used language of a user only. The main intended usage of this dataset is discrimination between closely-related languages on the level of a Twitter user (not a single tweet). The only pre-processing performed on the texts of the tweets is the transliteration from the Cyrillic into the Latin script so that the dataset measures the quality of the user classifications regardless of the script used. |
dc.language.iso | bos |
dc.language.iso | hrv |
dc.language.iso | cnr |
dc.language.iso | srp |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://www.informatica.si/index.php/informatica/article/view/746 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.clarin.si/info/k-centre/ |
dc.subject | |
dc.subject | language identification |
dc.subject | closely related languages |
dc.title | The Twitter user dataset for discriminating between Bosnian, Croatian, Montenegrin and Serbian Twitter-HBS 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other |
sponsor | ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds |
size.info | 614 items |
size.info | 390268 texts |
files.count | 1 |
files.size | 13605382 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- Twitter-HBS.zip
- Size
- 12.98 MB
- Format
- application/zip
- Description
- Dataset archive
- MD5
- 03fe6fb00bd5b7d98c575b0858e9e832