Prikaži enostavni zapis vnosa

 
dc.contributor.author Dobranić, Filip
dc.contributor.author Evkoski, Bojan
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2023-12-22T08:40:36Z
dc.date.available 2023-12-22T08:40:36Z
dc.date.issued 2023-12-20
dc.identifier.uri http://hdl.handle.net/11356/1881
dc.description The corpus of Slovenian periodicals sPeriodika contains linguistically annotated periodicals published during the 18th, 19th, and beginning of 20th century (1771-1914). The periodical issues were retreived from Slovenia's national library's digital library service (https://dlib.si) in the form of OCR-ed PDF and TXT files. Before linguisticly annotating the documents (lemmatisation, part-of-speech tagging, and named entity recognition) with CLASSLA-Stanza (https://github.com/clarinsi/classla), the OCR-ed texts were corrected with a lightweight and robust approach using cSMTiser (https://github.com/clarinsi/csmtiser), a text normalisation tool based on character-level machine translation. This OCR post-correction model was trained on a set of manually corrected samples (300 random paragraphs at least 100 characters in length) from the original texts. The documents in the collection are enriched with the following metadata obtained from dLib: - Document ID (URN) - Periodical name - Document (periodical issue) title - Volume number (if available) - Issue number (if available) - Year of publication - Date of publication (of varying granularity, based on original metadata available) - Source (URL of the original digitised document available at dlib.si) - Image (see below) - Quality (see below) The majority of documents are pagewise aligned with the scanned images of original prints. Using a concordancer the metadata allows for a single-click route to the image of the page in question for further investigation and checking of the OCR results. In other cases, only links to full document PDFs are provided. For custom use without a concordancer, the images are available at https://nl.ijs.si/inz/speriodika/ with individual files having the form <document_id>-<page_number>.jpg. In addition to metadata published by dLib, the corpus contains estimates of the quality of OCR-ed text for individual pages. Pages are classified as "low" quality when mistakes are common and the text itself is mostly suitable for close reading. Otherwise, with pages appropriate for distant reading tasks, the quality metadatum is set to "good".
dc.language.iso slv
dc.publisher Institute of Contemporary History
dc.relation.isreferencedby https://doi.org/10.5281/zenodo.13936418
dc.relation.isreferencedby https://aclanthology.org/2024.lrec-main.61/
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.inz.si/en/dihur/
dc.subject historical language
dc.subject periodicals
dc.subject specialised corpus
dc.title Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor Slovenian Research Agency (ARRS) P6-0436 Basic national research program 'Digital Humanities' (2022-2027) nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 910064957 tokens
size.info 708306576 words
size.info 50904086 sentences
files.count 1
files.size 6569725938
featuredService.kontext search|https://www.clarin.si/kontext/query?corpname=speriodika
featuredService.noske search|https://www.clarin.si/ske/#dashboard?corpname=speriodika


 Datoteke v tem vnosu

Icon
Ime
sPeriodika.1.0.zip
Velikost
6.12 GB
Format
application/zip
Opis
The corpus in vertical format with CLARIN.SI registry file
MD5
669a56bd72434b36f9a983a76ed81e10
 Prenesi datoteko  Predogled
 Predogled datoteke  

Prikaži enostavni zapis vnosa