dc.contributor.author | Dobranić, Filip |
dc.contributor.author | Konda, Karin |
dc.contributor.author | Evkoski, Bojan |
dc.contributor.author | Ljubešić, Nikola |
dc.date.accessioned | 2024-01-10T13:20:24Z |
dc.date.available | 2024-01-10T13:20:24Z |
dc.date.issued | 2024-01-10 |
dc.identifier.uri | http://hdl.handle.net/11356/1907 |
dc.description | The post-OCR correction dataset consists of paragraphs of text, at least 100 characters in length, extracted from documents randomly sampled from the sPeriodika dataset (http://hdl.handle.net/11356/1881) of Slovenian historical periodicals. From each document five paragraphs were randomly sampled. If the paragraph was longer than 500 characters, it was trimmed to that length. The correction was performed by one human annotator having access to the scan of the original document. Out of the original collection of 450 paragraphs, 41 were discarded due to non-running text or very bad quality of the OCR. The metadata in the CSV dataset are the following: - URN of the document - link to the original PDF in dLib - name of the periodical - publisher of the periodical - publication date - original text - corrected text - line offset (zero-indexed) - character length of the paragraph (trimmed to max. 500 characters) |
dc.language.iso | slv |
dc.publisher | Institute of Contemporary History |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.inz.si/en/dihur/ |
dc.subject | historical corpus |
dc.subject | optical character recognition |
dc.subject | post-correction |
dc.title | Post-OCR correction training dataset sPeriodika-postOCR |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Filip Dobranić filip.dobranic@inz.si Institute of Contemporary History |
sponsor | Slovenian Research Agency (ARRS) P6-0436 Basic national research program 'Digital Humanities' (2022-2027) nationalFunds |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
size.info | 409 texts |
files.count | 1 |
files.size | 307809 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- sPeriodika-postOCR.csv
- Size
- 300.59 KB
- Format
- CSV file
- Description
- CSV file
- MD5
- 1bcfd4088209dde477cca19725620443