Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0

Name: Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Dobranić, Filip; Evkoski, Bojan; Ljubešić, Nikola

Show simple item record

dc.contributor.author	Dobranić, Filip
dc.contributor.author	Evkoski, Bojan
dc.contributor.author	Ljubešić, Nikola
dc.date.accessioned	2023-12-22T08:40:36Z
dc.date.available	2023-12-22T08:40:36Z
dc.date.issued	2023-12-20
dc.identifier.uri	http://hdl.handle.net/11356/1881
dc.description	The corpus of Slovenian periodicals sPeriodika contains linguistically annotated periodicals published during the 18th, 19th, and beginning of 20th century (1771-1914). The periodical issues were retreived from Slovenia's national library's digital library service (https://dlib.si) in the form of OCR-ed PDF and TXT files. Before linguistically annotating the documents (lemmatisation, part-of-speech tagging, and named entity recognition) with CLASSLA-Stanza (https://github.com/clarinsi/classla), the OCR-ed texts were corrected with a lightweight and robust approach using cSMTiser (https://github.com/clarinsi/csmtiser), a text normalisation tool based on character-level machine translation. This OCR post-correction model was trained on a set of manually corrected samples (300 random paragraphs at least 100 characters in length) from the original texts, cf. http://hdl.handle.net/11356/1907. The documents in the collection are enriched with the following metadata obtained from dLib: - Document ID (URN) - Periodical name - Document (periodical issue) title - Volume number (if available) - Issue number (if available) - Year of publication - Date of publication (of varying granularity, based on original metadata available) - Source (URL of the original digitised document available at dlib.si) - Image (see below) - Quality (see below) The majority of documents are pagewise aligned with the scanned images of original prints. Using a concordancer the metadata allows for a single-click route to the image of the page in question for further investigation and checking of the OCR results. In other cases, only links to full document PDFs are provided. For custom use without a concordancer, the images are available at https://nl.ijs.si/inz/speriodika/ with individual files having the form <document_id>-<page_number>.jpg. In addition to metadata published by dLib, the corpus contains estimates of the quality of OCR-ed text for individual pages. Pages are classified as "low" quality when mistakes are common and the text itself is mostly suitable for close reading. Otherwise, with pages appropriate for distant reading tasks, the quality metadatum is set to "good". The corpus is available in vertical format with linguistic annotations, as well as JSON files, which contain the corpus texts in all stages of processing - see the sample JSON for README explaining the format and a sample file.
dc.language.iso	slv
dc.publisher	Institute of Contemporary History
dc.relation.isreferencedby	https://doi.org/10.5281/zenodo.13936418
dc.relation.isreferencedby	https://aclanthology.org/2024.lrec-main.61/
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.inz.si/en/dihur/
dc.subject	historical language
dc.subject	periodicals
dc.subject	specialised corpus
dc.title	Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor	Slovenian Research Agency (ARRS) P6-0436 Basic national research program 'Digital Humanities' (2022-2027) nationalFunds
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info	910064957 tokens
size.info	708306576 words
size.info	50904086 sentences
size.info	149391 texts
files.count	3
files.size	20119632492
featuredService.kontext	search\|https://www.clarin.si/kontext/query?corpname=speriodika
featuredService.noske	search\|https://www.clarin.si/ske/#dashboard?corpname=speriodika