dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Zupan, Katja |
dc.contributor.author | Fišer, Darja |
dc.contributor.author | Erjavec, Tomaž |
dc.date.accessioned | 2016-07-27T12:15:05Z |
dc.date.available | 2016-09-01T23:00:07Z |
dc.date.issued | 2016-09-19 |
dc.identifier.uri | http://hdl.handle.net/11356/1068 |
dc.description | Data used in the experiments described in: Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany. https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf (https://www.linguistics.rub.de/konvens16/) Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (*.orig.txt) and the data with hand-normalised words (*.norm.txt). The files are aligned by lines. There are four datasets: - goo300k-bohoric: historical Slovene, hard case (<1850) - goo300k-gaj: historical Slovene, easy case (1850 - 1900) - tweet-L3: Slovene tweets, hard case (non-standard language) - tweet-L1: Slovene tweets, easy case (mostly standard language) The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (https://nl.ijs.si/janes/english/). The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character. |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.subject | word normalisation |
dc.subject | historical language |
dc.subject | computer-mediated communication |
dc.subject | experimental data |
dc.subject | manual annotation |
dc.title | Dataset of normalised Slovene text KonvNormSl 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds |
size.info | 427000 tokens |
files.count | 1 |
files.size | 4787953 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- konvNormSl.zip
- Size
- 4.57 MB
- Format
- application/zip
- Description
- Dataset
- MD5
- 98a809350431cce453224e842a413212
- konvNormSl
- README.txt1 kB
- token
- dev
- goo300k-gaj.token.dev.norm.txt302 kB
- tweet-L3.token.dev.norm.txt57 kB
- tweet-L1.token.dev.orig.txt57 kB
- goo300k-gaj.token.dev.orig.txt303 kB
- tweet-L3.token.dev.orig.txt56 kB
- goo300k-bohoric.token.dev.norm.txt82 kB
- tweet-L1.token.dev.norm.txt57 kB
- goo300k-bohoric.token.dev.orig.txt85 kB
- train
- goo300k-bohoric.token.train.orig.txt733 kB
- tweet-L1.token.train.orig.txt452 kB
- goo300k-gaj.token.train.norm.txt2 MB
- tweet-L3.token.train.norm.txt484 kB
- goo300k-gaj.token.train.orig.txt2 MB
- tweet-L3.token.train.orig.txt471 kB
- goo300k-bohoric.token.train.norm.txt705 kB
- tweet-L1.token.train.norm.txt454 kB
- test
- tweet-L3.token.test.orig.txt58 kB
- goo300k-gaj.token.test.norm.txt314 kB
- goo300k-gaj.token.test.orig.txt314 kB
- tweet-L1.token.test.norm.txt58 kB
- goo300k-bohoric.token.test.norm.txt85 kB
- tweet-L3.token.test.norm.txt60 kB
- tweet-L1.token.test.orig.txt58 kB
- goo300k-bohoric.token.test.orig.txt88 kB
- dev
- segment
- dev
- goo300k-gaj.segment.dev.norm.txt255 kB
- tweet-L3.segment.dev.norm.txt48 kB
- goo300k-bohoric.segment.dev.norm.txt69 kB
- goo300k-gaj.segment.dev.orig.txt256 kB
- tweet-L3.segment.dev.orig.txt47 kB
- goo300k-bohoric.segment.dev.orig.txt72 kB
- tweet-L1.segment.dev.norm.txt48 kB
- tweet-L1.segment.dev.orig.txt48 kB
- train
- goo300k-gaj.segment.train.norm.txt1 MB
- goo300k-bohoric.segment.train.norm.txt593 kB
- tweet-L3.segment.train.orig.txt394 kB
- tweet-L1.segment.train.orig.txt385 kB
- goo300k-gaj.segment.train.orig.txt1 MB
- goo300k-bohoric.segment.train.orig.txt621 kB
- tweet-L3.segment.train.norm.txt407 kB
- tweet-L1.segment.train.norm.txt386 kB
- test
- tweet-L3.segment.test.orig.txt48 kB
- goo300k-bohoric.segment.test.orig.txt74 kB
- tweet-L1.segment.test.norm.txt49 kB
- tweet-L1.segment.test.orig.txt49 kB
- goo300k-gaj.segment.test.norm.txt264 kB
- tweet-L3.segment.test.norm.txt49 kB
- goo300k-bohoric.segment.test.norm.txt71 kB
- goo300k-gaj.segment.test.orig.txt264 kB
- dev