Dataset of normalised Slovene text KonvNormSl 1.0

Name: Dataset of normalised Slovene text KonvNormSl 1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Ljubešić, Nikola; Zupan, Katja; Fišer, Darja; Erjavec, Tomaž

Show simple item record

dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Zupan, Katja
dc.contributor.author	Fišer, Darja
dc.contributor.author	Erjavec, Tomaž
dc.date.accessioned	2016-07-27T12:15:05Z
dc.date.available	2016-09-01T23:00:07Z
dc.date.issued	2016-09-19
dc.identifier.uri	http://hdl.handle.net/11356/1068
dc.description	Data used in the experiments described in: Nikola Ljubešić, Katja Zupan, Darja Fišer and Tomaž Erjavec: Normalising Slovene data: historical texts vs. user-generated content. Proceedings of KONVENS 2016, September 19–21, 2016, Bochum, Germany. https://www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf (https://www.linguistics.rub.de/konvens16/) Data are split into the "token" folder (experiment on normalising individual tokens) and "segment" folder (experiment on normalising whole segments of text, i.e. sentences or tweets). Each experiment folder contains the "train", "dev" and "test" subfolders. Each subfolder contains two files for each sample, the original data (.orig.txt) and the data with hand-normalised words (.norm.txt). The files are aligned by lines. There are four datasets: - goo300k-bohoric: historical Slovene, hard case (<1850) - goo300k-gaj: historical Slovene, easy case (1850 - 1900) - tweet-L3: Slovene tweets, hard case (non-standard language) - tweet-L1: Slovene tweets, easy case (mostly standard language) The goo300k data come from http://hdl.handle.net/11356/1025, while the tweet data originate from the JANES project (https://nl.ijs.si/janes/english/). The text in the files has been split by inserting spaces between characters, with underscore (_) substituting the space character. Tokens not relevant for normalisation (e.g. URLs, hashtags) have been substituted by the inverted question mark '¿' character.
dc.language.iso	slv
dc.publisher	Jožef Stefan Institute
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.subject	word normalisation
dc.subject	historical language
dc.subject	computer-mediated communication
dc.subject	experimental data
dc.subject	manual annotation
dc.title	Dataset of normalised Slovene text KonvNormSl 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor	ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds
sponsor	ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds
size.info	427000 tokens
files.count	1
files.size	4787953

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: konvNormSl.zip
Size: 4.57 MB
Format: application/zip
Description: Dataset
MD5: 98a809350431cce453224e842a413212

Download file Preview

File Preview

konvNormSl
- README.txt1 kB
- token
  - dev
    - goo300k-gaj.token.dev.norm.txt302 kB
    - tweet-L3.token.dev.norm.txt57 kB
    - tweet-L1.token.dev.orig.txt57 kB
    - goo300k-gaj.token.dev.orig.txt303 kB
    - tweet-L3.token.dev.orig.txt56 kB
    - goo300k-bohoric.token.dev.norm.txt82 kB
    - tweet-L1.token.dev.norm.txt57 kB
    - goo300k-bohoric.token.dev.orig.txt85 kB
  - train
    - goo300k-bohoric.token.train.orig.txt733 kB
    - tweet-L1.token.train.orig.txt452 kB
    - goo300k-gaj.token.train.norm.txt2 MB
    - tweet-L3.token.train.norm.txt484 kB
    - goo300k-gaj.token.train.orig.txt2 MB
    - tweet-L3.token.train.orig.txt471 kB
    - goo300k-bohoric.token.train.norm.txt705 kB
    - tweet-L1.token.train.norm.txt454 kB
  - test
    - tweet-L3.token.test.orig.txt58 kB
    - goo300k-gaj.token.test.norm.txt314 kB
    - goo300k-gaj.token.test.orig.txt314 kB
    - tweet-L1.token.test.norm.txt58 kB
    - goo300k-bohoric.token.test.norm.txt85 kB
    - tweet-L3.token.test.norm.txt60 kB
    - tweet-L1.token.test.orig.txt58 kB
    - goo300k-bohoric.token.test.orig.txt88 kB
- segment
  - dev
    - goo300k-gaj.segment.dev.norm.txt255 kB
    - tweet-L3.segment.dev.norm.txt48 kB
    - goo300k-bohoric.segment.dev.norm.txt69 kB
    - goo300k-gaj.segment.dev.orig.txt256 kB
    - tweet-L3.segment.dev.orig.txt47 kB
    - goo300k-bohoric.segment.dev.orig.txt72 kB
    - tweet-L1.segment.dev.norm.txt48 kB
    - tweet-L1.segment.dev.orig.txt48 kB
  - train
    - goo300k-gaj.segment.train.norm.txt1 MB
    - goo300k-bohoric.segment.train.norm.txt593 kB
    - tweet-L3.segment.train.orig.txt394 kB
    - tweet-L1.segment.train.orig.txt385 kB
    - goo300k-gaj.segment.train.orig.txt1 MB
    - goo300k-bohoric.segment.train.orig.txt621 kB
    - tweet-L3.segment.train.norm.txt407 kB
    - tweet-L1.segment.train.norm.txt386 kB
  - test
    - tweet-L3.segment.test.orig.txt48 kB
    - goo300k-bohoric.segment.test.orig.txt74 kB
    - tweet-L1.segment.test.norm.txt49 kB
    - tweet-L1.segment.test.orig.txt49 kB
    - goo300k-gaj.segment.test.norm.txt264 kB
    - tweet-L3.segment.test.norm.txt49 kB
    - goo300k-bohoric.segment.test.norm.txt71 kB
    - goo300k-gaj.segment.test.orig.txt264 kB

Show simple item record

Files in this item

Partners

Partners

Repository