Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Rupnik, Peter
dc.date.accessioned 2022-01-27T18:55:44Z
dc.date.available 2022-01-27T18:55:44Z
dc.date.issued 2022-01-26
dc.identifier.uri http://hdl.handle.net/11356/1482
dc.description The Twitter-HBS dataset consists of Twitter users, their tweets, and the label of their predominantly used language - Bosnian, Croatian, Montenegrin, or Serbian. Among the tweets, there are also tweets in other languages (mainly English) as the label encodes the predominantly used language of a user only. The main intended usage of this dataset is discrimination between closely-related languages on the level of a Twitter user (not a single tweet). The only pre-processing performed on the texts of the tweets is the transliteration from the Cyrillic into the Latin script so that the dataset measures the quality of the user classifications regardless of the script used.
dc.language.iso bos
dc.language.iso hrv
dc.language.iso cnr
dc.language.iso srp
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://www.informatica.si/index.php/informatica/article/view/746
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.clarin.si/info/k-centre/
dc.subject Twitter
dc.subject language identification
dc.subject closely related languages
dc.title The Twitter user dataset for discriminating between Bosnian, Croatian, Montenegrin and Serbian Twitter-HBS 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds
size.info 614 items
size.info 390268 texts
files.count 1
files.size 13605382


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
Twitter-HBS.zip
Size
12.98 MB
Format
application/zip
Description
Dataset archive
MD5
03fe6fb00bd5b7d98c575b0858e9e832
 Download file  Preview
 File Preview  
    • Twitter-HBS.txt659 B
    • Twitter-HBS.json34 MB

Show simple item record