Show simple item record Ljubešić, Nikola Zupan, Katja Fišer, Darja Erjavec, Tomaž 2016-07-27T12:15:05Z 2016-09-01T23:00:07Z 2016-09-19
dc.description Data used in the experiments described in: Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany. ( Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (*.orig.txt) and the data with hand-normalised words (*.norm.txt). The files are aligned by lines. There are four datasets: - goo300k-bohoric: historical Slovene, hard case (<1850) - goo300k-gaj: historical Slovene, easy case (1850 - 1900) - tweet-L3: Slovene tweets, hard case (non-standard language) - tweet-L1: Slovene tweets, easy case (mostly standard language) The goo300k data come from, while the tweet data originate from the JANES project ( The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.label PUB
dc.subject word normalisation
dc.subject historical language
dc.subject computer-mediated communication
dc.subject experimental data
dc.subject manual annotation
dc.title Dataset of normalised Slovene text KonvNormSl 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Tomaž Erjavec Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds 427000 tokens
files.count 1
files.size 4787953

 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
4.57 MB
 Download file  Preview
 File Preview  
  • konvNormSl
    • README.txt1 kB
    • token
      • dev
        • kB
        • kB
        • kB
        • kB
        • kB
        • kB
        • kB
        • kB
      • train
        • goo300k-bohoric.token.train.orig.txt733 kB
        • tweet-L1.token.train.orig.txt452 kB
        • goo300k-gaj.token.train.norm.txt2 MB
        • tweet-L3.token.train.norm.txt484 kB
        • goo300k-gaj.token.train.orig.txt2 MB
        • tweet-L3.token.train.orig.txt471 kB
        • goo300k-bohoric.token.train.norm.txt705 kB
        • tweet-L1.token.train.norm.txt454 kB
      • test
        • tweet-L3.token.test.orig.txt58 kB
        • goo300k-gaj.token.test.norm.txt314 kB
        • goo300k-gaj.token.test.orig.txt314 kB
        • tweet-L1.token.test.norm.txt58 kB
        • goo300k-bohoric.token.test.norm.txt85 kB
        • tweet-L3.token.test.norm.txt60 kB
        • tweet-L1.token.test.orig.txt58 kB
        • goo300k-bohoric.token.test.orig.txt88 kB
    • segment
      • dev
        • kB
        • kB
        • kB
        • kB
        • kB
        • kB
        • kB
        • kB
      • train
        • goo300k-gaj.segment.train.norm.txt1 MB
        • goo300k-bohoric.segment.train.norm.txt593 kB
        • tweet-L3.segment.train.orig.txt394 kB
        • tweet-L1.segment.train.orig.txt385 kB
        • goo300k-gaj.segment.train.orig.txt1 MB
        • goo300k-bohoric.segment.train.orig.txt621 kB
        • tweet-L3.segment.train.norm.txt407 kB
        • tweet-L1.segment.train.norm.txt386 kB
      • test
        • tweet-L3.segment.test.orig.txt48 kB
        • goo300k-bohoric.segment.test.orig.txt74 kB
        • tweet-L1.segment.test.norm.txt49 kB
        • tweet-L1.segment.test.orig.txt49 kB
        • goo300k-gaj.segment.test.norm.txt264 kB
        • tweet-L3.segment.test.norm.txt49 kB
        • goo300k-bohoric.segment.test.norm.txt71 kB
        • goo300k-gaj.segment.test.orig.txt264 kB

Show simple item record