| dc.contributor.author |
Vintar, Špela |
| dc.contributor.author |
Gorjanc, Vojko |
| dc.contributor.author |
Erjavec, Tomaž |
| dc.contributor.author |
Fišer, Darja |
| dc.contributor.author |
Mezeg, Adriana |
| dc.date.accessioned |
2026-03-11T11:45:45Z |
| dc.date.available |
2026-03-11T11:45:45Z |
| dc.date.issued |
2026-03-10 |
| dc.identifier.uri |
http://hdl.handle.net/11356/2077 |
| dc.description |
The Spook corpus was compiled to enable corpus-based studies in translation and comprises 713 texts and about 375 thousand words. It is composed of three types of texts. The first comprises foreign language texts in French, English, German, and Italian. The second type are the corresponding texts is in Slovenian. These two types of texts are aligned on the sentence level and comparable in terms of genre and time of publication. The third type of texts consists of original Slovenian texts, and is comparable to the Slovenian part of the parallel corpora.
The transcription of the texts and paragraph-level alignment of the originals/transations was performed manually.
The texts have been automatically tokenised, sentence segmented, PoS tagged and lemmatised in 2012. Linguistic processing of Slovenian texts was performed by ToTaLe (which used TnT for PoS tagging and CLOG for lemmatisation), while German, English, French and Italian texts were analysed by TreeTagger. The PoS tags in the corpus are given in two variants. One set is as output by the tagger, which is the MULTEXT-East tag for Slovenian (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), while other other sets are as output by TreeTagger for each language. The second variant of PoS tags is a mapping of the original tags to the Spook tagset (https://nl.ijs.si/spook/msd/html-en/).
Version 1.0 was released in the scope of the project in 2021 but was available only to project participants. This version updates the TEI encoding of the corpus and changes the vertical files so that they also include the SPOOK tags as attribute-value pairs. It also removes the parallel fiction part of the corpus (2 x 35 texts) due to copyright considerations. Note, however, that these texts are included in the concordancer-mounted corpus. |
| dc.language.iso |
slv |
| dc.language.iso |
fra |
| dc.language.iso |
eng |
| dc.language.iso |
deu |
| dc.language.iso |
ita |
| dc.publisher |
Faculty of Arts, University of Ljubljana |
| dc.relation.isreferencedby |
https://doi.org/10.4312/9789612375652 |
| dc.rights |
CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0 |
| dc.rights.uri |
https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0 |
| dc.rights.label |
ACA |
| dc.source.uri |
https://nl.ijs.si/spook/ |
| dc.subject |
parallel corpus |
| dc.subject |
TEI |
| dc.subject |
manual translation |
| dc.title |
Slovenian translation corpus Spook 1.1 |
| dc.type |
corpus |
| metashare.ResourceInfo#ContentInfo.mediaType |
text |
| has.files |
yes |
| branding |
CLARIN.SI data & tools |
| contact.person |
Špela Vintar spela.vintar@ijs.si Jožef Stefan Institute |
| sponsor |
ARRS (Slovenian Research Agency) J6-2009-0581 Slovenian translation studies - resources and research nationalFunds |
| size.info |
713 texts |
| size.info |
10567 translationUntis |
| size.info |
375494 words |
| files.count |
2 |
| files.size |
62434366 |
| featuredService.noske |
search foreign language parallel texts|https://www.clarin.si/ske/#dashboard?corpname=spook_xx |
| featuredService.noske |
search Slovenian parallel texts|https://www.clarin.si/ske/#dashboard?corpname=spook_xx_sl |
| featuredService.noske |
search Slovenian reference texts|https://www.clarin.si/ske/#dashboard?corpname=spook_sl |