449 Star 3.5K Fork 854

PaddlePaddle / PaddleOCR

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
layout_datasets_en.md 2.74 KB
一键复制 编辑 原始数据 按行查看 历史
littletomatodonkey 提交于 2022-08-23 11:34 . add layout en

Layout Analysis Dataset

Here are the common datasets of layout anlysis, which are being updated continuously. Welcome to contribute datasets.

Most of the layout analysis datasets are object detection datasets. In addition to open source datasets, you can also label or synthesize datasets using tools such as labelme and so on.

1. PubLayNet dataset

  • Data source: https://github.com/ibm-aur-nlp/PubLayNet
  • Data introduction: The PubLayNet dataset contains 350000 training images and 11000 validation images. There are 5 categories in total, namely: text, title, list, table, figure. Some images and their annotations as shown below.

2、CDLA数据集

  • Data source: https://github.com/buptlihang/CDLA
  • Data introduction: CDLA dataset contains 5000 training images and 1000 validation images with 10 categories, which are Text, Title, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, Equation. Some images and their annotations as shown below.

3、TableBank dataet

  • Data source: https://doc-analysis.github.io/tablebank-page/index.html
  • Data introduction: TableBank dataset contains 2 types of document: Latex (187199 training images, 7265 validation images and 5719 testing images) and Word (73383 training images 2735 validation images and 2281 testing images). Some images and their annotations as shown below.
Python
1
https://gitee.com/paddlepaddle/PaddleOCR.git
git@gitee.com:paddlepaddle/PaddleOCR.git
paddlepaddle
PaddleOCR
PaddleOCR
release/2.6

搜索帮助