Show simple item record

 
dc.contributor.author Dobranić, Filip
dc.contributor.author Konda, Karin
dc.contributor.author Evkoski, Bojan
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2024-01-10T13:20:24Z
dc.date.available 2024-01-10T13:20:24Z
dc.date.issued 2024-01-10
dc.identifier.uri http://hdl.handle.net/11356/1907
dc.description The post-OCR correction dataset consists of paragraphs of text, at least 100 characters in length, extracted from documents randomly sampled from the sPeriodika dataset (http://hdl.handle.net/11356/1881) of Slovenian historical periodicals. From each document five paragraphs were randomly sampled. If the paragraph was longer than 500 characters, it was trimmed to that length. The correction was performed by one human annotator having access to the scan of the original document. Out of the original collection of 450 paragraphs, 41 were discarded due to non-running text or very bad quality of the OCR. The metadata in the CSV dataset are the following: - URN of the document - link to the original PDF in dLib - name of the periodical - publisher of the periodical - publication date - original text - corrected text - line offset (zero-indexed) - character length of the paragraph (trimmed to max. 500 characters)
dc.language.iso slv
dc.publisher Institute of Contemporary History
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.inz.si/en/dihur/
dc.subject historical corpus
dc.subject optical character recognition
dc.subject post-correction
dc.title Post-OCR correction training dataset sPeriodika-postOCR
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Filip Dobranić filip.dobranic@inz.si Institute of Contemporary History
sponsor Slovenian Research Agency (ARRS) P6-0436 Basic national research program 'Digital Humanities' (2022-2027) nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 409 texts
files.count 1
files.size 307809


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
sPeriodika-postOCR.csv
Size
300.59 KB
Format
CSV file
Description
CSV file
MD5
1bcfd4088209dde477cca19725620443
 Download file

Show simple item record