📊 表格结构识别

简介

💖该仓库是用来对文档中表格做结构化识别的推理库，包括来自阿里读光有线和无线表格识别模型，llaipython(微信)贡献的有线表格模型，网易Qanything内置表格分类模型等。

特点

⚡ 快采用ONNXRuntime作为推理引擎，cpu下单图推理1-7s

🎯 准: 结合表格类型分类模型，区分有线表格，无线表格，任务更细分，精度更高

🛡️ 稳: 不依赖任何第三方训练框架，只依赖必要基础库，避免包冲突

在线演示

modelscope魔塔 huggingface

效果展示

指标结果

TableRecognitionMetric 评测工具 huggingface数据集 modelscope 数据集 Rapid OCR

测试环境: ubuntu 20.04 python 3.10.10 opencv-python 4.10.0.84

注: StructEqTable 输出为 latex，只取成功转换为html并去除样式标签后进行测评

Surya-Tabled 使用内置ocr模块，表格模型为行列识别模型，无法识别单元格合并，导致分数较低

方法	TEDS	TEDS-only-structure
surya-tabled(--skip-detect)	0.33437	0.65865
surya-tabled	0.33940	0.67103
deepdoctection(rag-flow)	0.59975	0.69918
ppstructure_table_master	0.61606	0.73892
ppsturcture_table_engine	0.67924	0.78653
table_cls + wired_table_rec v1 + lineless_table_rec	0.68507	0.75140
StructEqTable	0.67310	0.81210
RapidTable(SLANet)	0.71654	0.81067
table_cls + wired_table_rec v2 + lineless_table_rec	0.73702	0.80210
RapidTable(SLANet-plus)	0.84481	0.91369

使用建议

wired_table_rec_v2(有线表格精度最高): 通用场景有线表格(论文，杂志，期刊, 收据，单据，账单)

paddlex-SLANet-plus(综合精度最高): 文档场景表格(论文，杂志，期刊中的表格)

安装

pip install wired_table_rec lineless_table_rec table_cls

快速使用

import os

from lineless_table_rec import LinelessTableRecognition
from lineless_table_rec.utils_table_recover import format_html, plot_rec_box_with_logic_info, plot_rec_box
from table_cls import TableCls
from wired_table_rec import WiredTableRecognition

lineless_engine = LinelessTableRecognition()
wired_engine = WiredTableRecognition()
table_cls = TableCls()
img_path = f'images/img14.jpg'

cls,elasp = table_cls(img_path)
if cls == 'wired':
    table_engine = wired_engine
else:
    table_engine = lineless_engine
  
html, elasp, polygons, logic_points, ocr_res = table_engine(img_path)
print(f"elasp: {elasp}")

# 使用其他ocr模型
#ocr_engine =RapidOCR(det_model_dir="xxx/det_server_infer.onnx",rec_model_dir="xxx/rec_server_infer.onnx")
#ocr_res, _ = ocr_engine(img_path)
#html, elasp, polygons, logic_points, ocr_res = table_engine(img_path, ocr_result=ocr_res)  

# output_dir = f'outputs'
# complete_html = format_html(html)
# os.makedirs(os.path.dirname(f"{output_dir}/table.html"), exist_ok=True)
# with open(f"{output_dir}/table.html", "w", encoding="utf-8") as file:
#     file.write(complete_html)
# # 可视化表格识别框 + 逻辑行列信息
# plot_rec_box_with_logic_info(
#     img_path, f"{output_dir}/table_rec_box.jpg", logic_points, polygons
# )
# # 可视化 ocr 识别框
# plot_rec_box(img_path, f"{output_dir}/ocr_box.jpg", ocr_res)

表格旋转及透视修正

需要gpu或更高精度场景，请参考项目RapidTableDet

pip install rapid-table-det

import os
import cv2
from rapid_table_det.utils import img_loader, visuallize, extract_table_img
from rapid_table_det.inference import TableDetector
table_det = TableDetector()
img_path = f"tests/test_files/chip.jpg"
result, elapse = table_det(img_path)
img = img_loader(img_path)
extract_img = img.copy()
#可能有多表格
for i, res in enumerate(result):
    box = res["box"]
    lt, rt, rb, lb = res["lt"], res["rt"], res["rb"], res["lb"]
    # 带识别框和左上角方向位置
    img = visuallize(img, box, lt, rt, rb, lb)
    # 透视变换提取表格图片
    wrapped_img = extract_table_img(extract_img.copy(), lt, rt, rb, lb)
#     cv2.imwrite(f"{out_dir}/{file_name}-extract-{i}.jpg", wrapped_img)
# cv2.imwrite(f"{out_dir}/{file_name}-visualize.jpg", img)

FAQ (Frequently Asked Questions)

问：识别框丢失了内部文字信息
- 答：默认使用的rapidocr小模型，如果需要更高精度的效果，可以从模型列表下载更高精度的ocr模型,在执行时传入ocr_result即可
问：模型支持 gpu 加速吗？
- 答：目前表格模型的推理非常快，有线表格在100ms级别，无线表格在500ms级别，主要耗时在ocr阶段，可以参考 rapidocr_paddle 加速ocr识别过程

TODO List

图片小角度偏移修正方法补充
增加数据集数量，增加更多评测对比
补充复杂场景表格检测和提取，解决旋转和透视导致的低识别率
优化表格分类器，优化无线表格模型

处理流程

flowchart TD
    A[/表格图片/] --> B([表格分类 table_cls])
    B --> C([有线表格识别 wired_table_rec]) & D([无线表格识别 lineless_table_rec]) --> E([文字识别 rapidocr_onnxruntime])
    E --> F[/html结构化输出/]

致谢

非常感谢 llaipython(微信，提供全套有偿高精度表格提取) 提供高精度有线表格模型。

非常感谢 MajexH完成deepdoctection(rag-flow)的表格识别测试

贡献指南

欢迎提交请求。对于重大更改，请先打开issue讨论您想要改变的内容。

请确保适当更新测试。

赞助

如果您想要赞助该项目，可直接点击当前页最上面的Sponsor按钮，请写好备注(您的Github账号名称)，方便添加到赞助列表中。

开源许可证

该项目采用Apache 2.0 开源许可证。

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github		.github
docs		docs
lineless_table_rec		lineless_table_rec
table_cls		table_cls
tests		tests
wired_table_rec		wired_table_rec
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
demo_lineless.py		demo_lineless.py
demo_table_cls.py		demo_table_cls.py
demo_wired.py		demo_wired.py
requirements.txt		requirements.txt
setup_lineless.py		setup_lineless.py
setup_table_cls.py		setup_table_cls.py
setup_wired.py		setup_wired.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 表格结构识别

最近更新

简介

特点

在线演示

效果展示

指标结果

使用建议

安装

快速使用

表格旋转及透视修正

FAQ (Frequently Asked Questions)

TODO List

处理流程

致谢

贡献指南

赞助

开源许可证

About

Releases

Packages

Languages

License

MetrodataTeam/TableStructureRec

Folders and files

Latest commit

History

Repository files navigation

📊 表格结构识别

最近更新

简介

特点

在线演示

效果展示

指标结果

使用建议

安装

快速使用

表格旋转及透视修正

FAQ (Frequently Asked Questions)

TODO List

处理流程

致谢

贡献指南

赞助

开源许可证

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages