dc.contributor.author | Dobranić, Filip |
dc.contributor.author | Evkoski, Bojan |
dc.contributor.author | Ljubešić, Nikola |
dc.date.accessioned | 2023-12-22T08:40:36Z |
dc.date.available | 2023-12-22T08:40:36Z |
dc.date.issued | 2023-12-20 |
dc.identifier.uri | http://hdl.handle.net/11356/1881 |
dc.description | The corpus of Slovenian periodicals sPeriodika contains linguistically annotated periodicals published during the 18th, 19th, and beginning of 20th century (1771-1914). The periodical issues were retreived from Slovenia's national library's digital library service (https://dlib.si) in the form of OCR-ed PDF and TXT files. Before linguisticly annotating the documents (lemmatisation, part-of-speech tagging, and named entity recognition) with CLASSLA-Stanza (https://github.com/clarinsi/classla), the OCR-ed texts were corrected with a lightweight and robust approach using cSMTiser (https://github.com/clarinsi/csmtiser), a text normalisation tool based on character-level machine translation. This OCR post-correction model was trained on a set of manually corrected samples (300 random paragraphs at least 100 characters in length) from the original texts. The documents in the collection are enriched with the following metadata obtained from dLib: - Document ID (URN) - Periodical name - Document (periodical issue) title - Volume number (if available) - Issue number (if available) - Year of publication - Date of publication (of varying granularity, based on original metadata available) - Source (URL of the original digitised document available at dlib.si) - Image (see below) - Quality (see below) The majority of documents are pagewise aligned with the scanned images of original prints. Using a concordancer the metadata allows for a single-click route to the image of the page in question for further investigation and checking of the OCR results. In other cases, only links to full document PDFs are provided. For custom use without a concordancer, the images are available at https://nl.ijs.si/inz/speriodika/ with individual files having the form <document_id>-<page_number>.jpg. In addition to metadata published by dLib, the corpus contains estimates of the quality of OCR-ed text for individual pages. Pages are classified as "low" quality when mistakes are common and the text itself is mostly suitable for close reading. Otherwise, with pages appropriate for distant reading tasks, the quality metadatum is set to "good". |
dc.language.iso | slv |
dc.publisher | Institute of Contemporary History |
dc.relation.isreferencedby | https://doi.org/10.5281/zenodo.13936418 |
dc.relation.isreferencedby | https://aclanthology.org/2024.lrec-main.61/ |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.inz.si/en/dihur/ |
dc.subject | historical language |
dc.subject | periodicals |
dc.subject | specialised corpus |
dc.title | Corpus of Slovenian periodicals (1771-1914) sPeriodika 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | Slovenian Research Agency (ARRS) P6-0436 Basic national research program 'Digital Humanities' (2022-2027) nationalFunds |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
size.info | 910064957 tokens |
size.info | 708306576 words |
size.info | 50904086 sentences |
files.count | 1 |
files.size | 6569725938 |
featuredService.kontext | search|https://www.clarin.si/kontext/query?corpname=speriodika |
featuredService.noske | search|https://www.clarin.si/ske/#dashboard?corpname=speriodika |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Ime
- sPeriodika.1.0.zip
- Velikost
- 6.12 GB
- Format
- application/zip
- Opis
- The corpus in vertical format with CLARIN.SI registry file
- MD5
- 669a56bd72434b36f9a983a76ed81e10
- sPeriodika.1.0
- speriodika.1.0.vert49 GB
- speriodika.1.0.regi2 kB