449 Star 3.5K Fork 852

PaddlePaddle / PaddleOCR

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
handwritten_datasets_en.md 2.30 KB
一键复制 编辑 原始数据 按行查看 历史
dyning 提交于 2020-07-17 15:38 . fix en doc

Handwritten OCR dataset

Here we have sorted out the commonly used handwritten OCR dataset datasets, which are being updated continuously. We welcome you to contribute datasets ~

  • [Institute of automation, Chinese Academy of Sciences - handwritten Chinese dataset](#Institute of automation, Chinese Academy of Sciences - handwritten Chinese dataset)
  • [NIST handwritten single character dataset - English](#NIST handwritten single character dataset - English)

Institute of automation, Chinese Academy of Sciences - handwritten Chinese dataset

  • Data source:http://www.nlpr.ia.ac.cn/databases/handwriting/Download.html

  • Data introduction:

    • It includes online and offline handwritten data,HWDB1.0~1.2 has totally 3895135 handwritten single character samples, which belong to 7356 categories (7185 Chinese characters and 171 English letters, numbers and symbols);HWDB2.0~2.2 has totally 5091 pages of images, which are divided into 52230 text lines and 1349414 words. All text and text samples are stored as grayscale images. Some sample words are shown below.

  • Download address:http://www.nlpr.ia.ac.cn/databases/handwriting/Download.html

  • 使用建议:Data for single character, white background, can form a large number of text lines for training. White background can be processed into transparent state, which is convenient to add various backgrounds. For the case of semantic needs, it is suggested to extract single character from real corpus to form text lines.

NIST handwritten single character dataset - English(NIST Handprinted Forms and Characters Database)

Python
1
https://gitee.com/paddlepaddle/PaddleOCR.git
git@gitee.com:paddlepaddle/PaddleOCR.git
paddlepaddle
PaddleOCR
PaddleOCR
release/2.2

搜索帮助