The Twitter user dataset for discriminating between Bosnian, Croatian, Montenegrin and Serbian Twitter-HBS 1.0

Name: The Twitter user dataset for discriminating between Bosnian, Croatian, Montenegrin and Serbian Twitter-HBS 1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Ljubešić, Nikola; Rupnik, Peter

Show simple item record

dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Rupnik, Peter
dc.date.accessioned	2022-01-27T18:55:44Z
dc.date.available	2022-01-27T18:55:44Z
dc.date.issued	2022-01-26
dc.identifier.uri	http://hdl.handle.net/11356/1482
dc.description	The Twitter-HBS dataset consists of Twitter users, their tweets, and the label of their predominantly used language - Bosnian, Croatian, Montenegrin, or Serbian. Among the tweets, there are also tweets in other languages (mainly English) as the label encodes the predominantly used language of a user only. The main intended usage of this dataset is discrimination between closely-related languages on the level of a Twitter user (not a single tweet). The only pre-processing performed on the texts of the tweets is the transliteration from the Cyrillic into the Latin script so that the dataset measures the quality of the user classifications regardless of the script used.
dc.language.iso	bos
dc.language.iso	hrv
dc.language.iso	cnr
dc.language.iso	srp
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://www.informatica.si/index.php/informatica/article/view/746
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.clarin.si/info/k-centre/
dc.subject	Twitter
dc.subject	language identification
dc.subject	closely related languages
dc.title	The Twitter user dataset for discriminating between Bosnian, Croatian, Montenegrin and Serbian Twitter-HBS 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor	Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor	ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds
size.info	614 items
size.info	390268 texts
files.count	1
files.size	13605382