Slovenian translation corpus Spook 1.1

Name: Slovenian translation corpus Spook 1.1
License: https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0

Vintar, Špela; Gorjanc, Vojko; Erjavec, Tomaž; Fišer, Darja; Mezeg, Adriana

Show simple item record

dc.contributor.author	Vintar, Špela
dc.contributor.author	Gorjanc, Vojko
dc.contributor.author	Erjavec, Tomaž
dc.contributor.author	Fišer, Darja
dc.contributor.author	Mezeg, Adriana
dc.date.accessioned	2026-03-11T11:45:45Z
dc.date.available	2026-03-11T11:45:45Z
dc.date.issued	2026-03-10
dc.identifier.uri	http://hdl.handle.net/11356/2077
dc.description	The Spook corpus was compiled to enable corpus-based studies in translation and comprises 713 texts and about 375 thousand words. It is composed of three types of texts. The first comprises foreign language texts in French, English, German, and Italian. The second type are the corresponding texts is in Slovenian. These two types of texts are aligned on the sentence level and comparable in terms of genre and time of publication. The third type of texts consists of original Slovenian texts, and is comparable to the Slovenian part of the parallel corpora. The transcription of the texts and paragraph-level alignment of the originals/transations was performed manually. The texts have been automatically tokenised, sentence segmented, PoS tagged and lemmatised in 2012. Linguistic processing of Slovenian texts was performed by ToTaLe (which used TnT for PoS tagging and CLOG for lemmatisation), while German, English, French and Italian texts were analysed by TreeTagger. The PoS tags in the corpus are given in two variants. One set is as output by the tagger, which is the MULTEXT-East tag for Slovenian (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), while other other sets are as output by TreeTagger for each language. The second variant of PoS tags is a mapping of the original tags to the Spook tagset (https://nl.ijs.si/spook/msd/html-en/). Version 1.0 was released in the scope of the project in 2021 but was available only to project participants. This version updates the TEI encoding of the corpus and changes the vertical files so that they also include the SPOOK tags as attribute-value pairs. It also removes the parallel fiction part of the corpus (2 x 35 texts) due to copyright considerations. Note, however, that these texts are included in the concordancer-mounted corpus.
dc.language.iso	slv
dc.language.iso	fra
dc.language.iso	eng
dc.language.iso	deu
dc.language.iso	ita
dc.publisher	Faculty of Arts, University of Ljubljana
dc.relation.isreferencedby	https://doi.org/10.4312/9789612375652
dc.rights	CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0
dc.rights.uri	https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0
dc.rights.label	ACA
dc.source.uri	https://nl.ijs.si/spook/
dc.subject	parallel corpus
dc.subject	TEI
dc.subject	manual translation
dc.title	Slovenian translation corpus Spook 1.1
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Špela Vintar spela.vintar@ijs.si Jožef Stefan Institute
sponsor	ARRS (Slovenian Research Agency) J6-2009-0581 Slovenian translation studies - resources and research nationalFunds
size.info	713 texts
size.info	10567 translationUntis
size.info	375494 words
files.count	2
files.size	62434366
featuredService.noske	search foreign language parallel texts\|https://www.clarin.si/ske/#dashboard?corpname=spook_xx
featuredService.noske	search Slovenian parallel texts\|https://www.clarin.si/ske/#dashboard?corpname=spook_xx_sl
featuredService.noske	search Slovenian reference texts\|https://www.clarin.si/ske/#dashboard?corpname=spook_sl