Show simple item record

 
dc.contributor.author Vintar, Špela
dc.contributor.author Gorjanc, Vojko
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Fišer, Darja
dc.contributor.author Mezeg, Adriana
dc.date.accessioned 2026-03-11T11:45:45Z
dc.date.available 2026-03-11T11:45:45Z
dc.date.issued 2026-03-10
dc.identifier.uri http://hdl.handle.net/11356/2077
dc.description The Spook corpus was compiled to enable corpus-based studies in translation and comprises 713 texts and about 375 thousand words. It is composed of three types of texts. The first comprises foreign language texts in French, English, German, and Italian. The second type are the corresponding texts is in Slovenian. These two types of texts are aligned on the sentence level and comparable in terms of genre and time of publication. The third type of texts consists of original Slovenian texts, and is comparable to the Slovenian part of the parallel corpora. The transcription of the texts and paragraph-level alignment of the originals/transations was performed manually. The texts have been automatically tokenised, sentence segmented, PoS tagged and lemmatised in 2012. Linguistic processing of Slovenian texts was performed by ToTaLe (which used TnT for PoS tagging and CLOG for lemmatisation), while German, English, French and Italian texts were analysed by TreeTagger. The PoS tags in the corpus are given in two variants. One set is as output by the tagger, which is the MULTEXT-East tag for Slovenian (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), while other other sets are as output by TreeTagger for each language. The second variant of PoS tags is a mapping of the original tags to the Spook tagset (https://nl.ijs.si/spook/msd/html-en/). Version 1.0 was released in the scope of the project in 2021 but was available only to project participants. This version updates the TEI encoding of the corpus and changes the vertical files so that they also include the SPOOK tags as attribute-value pairs. It also removes the parallel fiction part of the corpus (2 x 35 texts) due to copyright considerations. Note, however, that these texts are included in the concordancer-mounted corpus.
dc.language.iso slv
dc.language.iso fra
dc.language.iso eng
dc.language.iso deu
dc.language.iso ita
dc.publisher Faculty of Arts, University of Ljubljana
dc.relation.isreferencedby https://doi.org/10.4312/9789612375652
dc.rights CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0
dc.rights.uri https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0
dc.rights.label ACA
dc.source.uri https://nl.ijs.si/spook/
dc.subject parallel corpus
dc.subject TEI
dc.subject manual translation
dc.title Slovenian translation corpus Spook 1.1
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Špela Vintar spela.vintar@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) J6-2009-0581 Slovenian translation studies - resources and research nationalFunds
size.info 713 texts
size.info 10567 translationUntis
size.info 375494 words
files.count 2
files.size 62434366
featuredService.noske search foreign language parallel texts|https://www.clarin.si/ske/#dashboard?corpname=spook_xx
featuredService.noske search Slovenian parallel texts|https://www.clarin.si/ske/#dashboard?corpname=spook_xx_sl
featuredService.noske search Slovenian reference texts|https://www.clarin.si/ske/#dashboard?corpname=spook_sl


 Files in this item

 Download all files in item (59.54 MB)
This item is
Academic Use
and licensed under:
CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0
Inform Before Use Attribution Required Noncommercial
Icon
Name
Spook.TEI.zip
Size
30.42 MB
Format
application/zip
Description
Linguistically annotated corpus in TEI format
MD5
df125034dfac3e8df39fbc61145e8a91
 Download file
Icon
Name
Spook.vert.zip
Size
29.12 MB
Format
application/zip
Description
Corpus in derived vertical format
MD5
de62e2c6f18a457c7d4f0f4f05f37469
 Download file

Show simple item record