Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Zupan, Katja
dc.contributor.author Fišer, Darja
dc.contributor.author Erjavec, Tomaž
dc.date.accessioned 2016-07-27T12:15:05Z
dc.date.available 2016-09-01T23:00:07Z
dc.date.issued 2016-09-19
dc.identifier.uri http://hdl.handle.net/11356/1068
dc.description Data used in the experiments described in: Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany. https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf (https://www.linguistics.rub.de/konvens16/) Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (*.orig.txt) and the data with hand-normalised words (*.norm.txt). The files are aligned by lines. There are four datasets: - goo300k-bohoric: historical Slovene, hard case (<1850) - goo300k-gaj: historical Slovene, easy case (1850 - 1900) - tweet-L3: Slovene tweets, hard case (non-standard language) - tweet-L1: Slovene tweets, easy case (mostly standard language) The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (https://nl.ijs.si/janes/english/). The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.subject word normalisation
dc.subject historical language
dc.subject computer-mediated communication
dc.subject experimental data
dc.subject manual annotation
dc.title Dataset of normalised Slovene text KonvNormSl 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds
size.info 427000 tokens
files.count 1
files.size 4787953


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
konvNormSl.zip
Size
4.57 MB
Format
application/zip
Description
Dataset
MD5
98a809350431cce453224e842a413212
 Download file  Preview
 File Preview  
  • konvNormSl
    • README.txt1 kB
    • token
      • dev
        • goo300k-gaj.token.dev.norm.txt302 kB
        • tweet-L3.token.dev.norm.txt57 kB
        • tweet-L1.token.dev.orig.txt57 kB
        • goo300k-gaj.token.dev.orig.txt303 kB
        • tweet-L3.token.dev.orig.txt56 kB
        • goo300k-bohoric.token.dev.norm.txt82 kB
        • tweet-L1.token.dev.norm.txt57 kB
        • goo300k-bohoric.token.dev.orig.txt85 kB
      • train
        • goo300k-bohoric.token.train.orig.txt733 kB
        • tweet-L1.token.train.orig.txt452 kB
        • goo300k-gaj.token.train.norm.txt2 MB
        • tweet-L3.token.train.norm.txt484 kB
        • goo300k-gaj.token.train.orig.txt2 MB
        • tweet-L3.token.train.orig.txt471 kB
        • goo300k-bohoric.token.train.norm.txt705 kB
        • tweet-L1.token.train.norm.txt454 kB
      • test
        • tweet-L3.token.test.orig.txt58 kB
        • goo300k-gaj.token.test.norm.txt314 kB
        • goo300k-gaj.token.test.orig.txt314 kB
        • tweet-L1.token.test.norm.txt58 kB
        • goo300k-bohoric.token.test.norm.txt85 kB
        • tweet-L3.token.test.norm.txt60 kB
        • tweet-L1.token.test.orig.txt58 kB
        • goo300k-bohoric.token.test.orig.txt88 kB
    • segment
      • dev
        • goo300k-gaj.segment.dev.norm.txt255 kB
        • tweet-L3.segment.dev.norm.txt48 kB
        • goo300k-bohoric.segment.dev.norm.txt69 kB
        • goo300k-gaj.segment.dev.orig.txt256 kB
        • tweet-L3.segment.dev.orig.txt47 kB
        • goo300k-bohoric.segment.dev.orig.txt72 kB
        • tweet-L1.segment.dev.norm.txt48 kB
        • tweet-L1.segment.dev.orig.txt48 kB
      • train
        • goo300k-gaj.segment.train.norm.txt1 MB
        • goo300k-bohoric.segment.train.norm.txt593 kB
        • tweet-L3.segment.train.orig.txt394 kB
        • tweet-L1.segment.train.orig.txt385 kB
        • goo300k-gaj.segment.train.orig.txt1 MB
        • goo300k-bohoric.segment.train.orig.txt621 kB
        • tweet-L3.segment.train.norm.txt407 kB
        • tweet-L1.segment.train.norm.txt386 kB
      • test
        • tweet-L3.segment.test.orig.txt48 kB
        • goo300k-bohoric.segment.test.orig.txt74 kB
        • tweet-L1.segment.test.norm.txt49 kB
        • tweet-L1.segment.test.orig.txt49 kB
        • goo300k-gaj.segment.test.norm.txt264 kB
        • tweet-L3.segment.test.norm.txt49 kB
        • goo300k-bohoric.segment.test.norm.txt71 kB
        • goo300k-gaj.segment.test.orig.txt264 kB

Show simple item record