Unknown

Dataset Information

0

MiBio: A dataset for OCR post-processing evaluation.


ABSTRACT: We introduce a dataset for OCR post-processing model evaluation. This dataset contains fully aligned OCR texts and the ground truth recognition texts of a English biodiversity book. To better used for benchmark evaluation, we extracted the following information in TSV files: 1) 2907 OCR-generated errors with position in the OCR texts and correction in the ground truth text, 2) ground truth word and sentence segmentation of the OCR texts. In this article, we detail the data preprocessing and provide quantitative data analysis.

SUBMITTER: Mei J 

PROVIDER: S-EPMC6197712 | biostudies-literature | 2018 Dec

REPOSITORIES: biostudies-literature

altmetric image

Publications

MiBio: A dataset for OCR post-processing evaluation.

Mei Jie J   Islam Aminul A   Moh'd Abidalrahman A   Wu Yajing Y   Milios Evangelos E EE  

Data in brief 20180915


We introduce a dataset for OCR post-processing model evaluation. This dataset contains fully aligned OCR texts and the ground truth recognition texts of a English biodiversity book. To better used for benchmark evaluation, we extracted the following information in TSV files: 1) 2907 OCR-generated errors with position in the OCR texts and correction in the ground truth text, 2) ground truth word and sentence segmentation of the OCR texts. In this article, we detail the data preprocessing and prov  ...[more]

Similar Datasets

| S-EPMC5456666 | biostudies-other
| S-EPMC8191069 | biostudies-literature
| S-EPMC4157590 | biostudies-literature
| S-EPMC9747619 | biostudies-literature
| S-EPMC9823153 | biostudies-literature
| S-EPMC7722133 | biostudies-literature
| S-EPMC6472396 | biostudies-literature
| S-EPMC9525723 | biostudies-literature
| S-EPMC6400102 | biostudies-literature
| S-EPMC7101258 | biostudies-literature