Skip to content

Releases: opendatalab/MinerU

magic_pdf-1.3.5-released

17 Apr 03:38
8fb6794
Compare
Choose a tag to compare

magic_pdf-1.3.4-released

16 Apr 10:05
1b35f04
Compare
Choose a tag to compare

What's Changed

  • 2025/04/16 1.3.4 Released

    • Slightly improved the speed of OCR detection by removing some unused blocks.
    • Fixed page-level sorting errors caused by footnotes in certain cases. #2244
  • 2025/04/16 1.3.4 发布

    • 通过移除一些无用的块,小幅提升了ocr-det的速度
    • 修复部分情况下由footnote导致的页面内排序错误 #2244

New Contributors

Full Changelog: magic_pdf-1.3.3-released...magic_pdf-1.3.4-released

magic_pdf-1.3.3-released

14 Apr 10:37
a1df670
Compare
Choose a tag to compare

magic_pdf-1.3.2-released

12 Apr 11:07
d0ed731
Compare
Choose a tag to compare

What's Changed

  • 1.3.2 released

    • Fixed the issue of incompatible dependency package versions when installing in Python 3.13 environment on Windows systems.
    • Optimized memory usage during batch inference.
    • Improved the parsing effect of tables rotated by 90 degrees.
    • Enhanced the parsing accuracy for large tables in financial report samples.
    • Fixed the occasional word concatenation issue in English text areas when OCR language is not specified.(The model needs to be updated)
  • 1.3.2 发布

    • 修复了windows系统下,在python3.13环境安装时一些依赖包版本不兼容的问题
    • 优化批量推理时的内存占用
    • 优化旋转90度表格的解析效果
    • 优化财报样本中超大表格的解析效果
    • 修复了在未指定OCR语言时,英文文本区域偶尔出现的单词黏连问题(需要更新模型)

Full Changelog: magic_pdf-1.3.1-released...magic_pdf-1.3.2-released

magic_pdf-1.3.1-released

08 Apr 10:22
b60166a
Compare
Choose a tag to compare

What's Changed

  • 1.3.1 released, fixed some compatibility issues

    • Supported Python 3.13
    • Resolved errors caused by transformers 4.51.0
    • Made the final adaptation for some outdated Linux systems (e.g., CentOS 7), and no further support will be guaranteed for subsequent versions. Installation Instructions
  • 1.3.1 发布,修复了一些兼容问题

    • 支持python 3.13
    • 解决因transformers 4.51.0 导致的报错
    • 为部分过时的linux系统(如centos7)做出最后适配,并不再保证后续版本的继续支持,安装说明

Full Changelog: magic_pdf-1.3.0-released...magic_pdf-1.3.1-released

magic_pdf-1.3.0-released

03 Apr 15:29
3963b96
Compare
Choose a tag to compare

What's Changed

  • Release of 1.3.0, in this version we made many optimizations and improvements:

    • Installation and compatibility optimization
      • By removing the use of layoutlmv3 in layout, resolved compatibility issues caused by detectron2.
      • Torch version compatibility extended to 2.2~2.6 (excluding 2.5).
      • CUDA compatibility supports 11.8/12.4/12.6 (CUDA version determined by torch), resolving compatibility issues for some users with 50-series and H-series GPUs.
      • Python compatible versions expanded to 3.10~3.12, solving the problem of automatic downgrade to 0.6.1 during installation in non-3.10 environments.
      • Offline deployment process optimized; no internet connection required after successful deployment to download any model files.
    • Performance optimization
      • By supporting batch processing of multiple PDF files (script example), improved parsing speed for small files in batches (compared to version 1.0.1, formula parsing speed increased by over 1400%, overall parsing speed increased by over 500%).
      • Optimized loading and usage of the mfr model, reducing GPU memory usage and improving parsing speed (requires re-execution of the model download process to obtain incremental updates of model files).
      • Optimized GPU memory usage, requiring only a minimum of 6GB to run this project.
      • Improved running speed on MPS devices.
    • Parsing effect optimization
      • Updated the mfr model to unimernet(2503), solving the issue of lost line breaks in multi-line formulas.
    • Usability Optimization
      • By using paddleocr2torch, completely replaced the use of the paddle framework and paddleocr in the project, resolving conflicts between paddle and torch, as well as thread safety issues caused by the paddle framework.
      • Added a real-time progress bar during the parsing process to accurately track progress, making the wait less painful.
  • 1.3.0 发布,在这个版本我们做出了许多优化和改进:

    • 安装与兼容性优化
      • 通过移除layout中layoutlmv3的使用,解决了由detectron2导致的兼容问题
      • torch版本兼容扩展到2.2~2.6(2.5除外)
      • cuda兼容支持11.8/12.4/12.6(cuda版本由torch决定),解决部分用户50系显卡与H系显卡的兼容问题
      • python兼容版本扩展到3.10~3.12,解决了在非3.10环境下安装时自动降级到0.6.1的问题
      • 优化离线部署流程,部署成功后不需要联网下载任何模型文件
    • 性能优化
      • 通过支持多个pdf文件的batch处理(脚本样例),提升了批量小文件的解析速度 (与1.0.1版本相比,公式解析速度最高提升超过1400%,整体解析速度最高提升超过500%)
      • 通过优化mfr模型的加载和使用,降低了显存占用并提升了解析速度(需重新执行模型下载流程以获得模型文件的增量更新)
      • 优化显存占用,最低仅需6GB即可运行本项目
      • 优化了在mps设备上的运行速度
    • 解析效果优化
      • mfr模型更新到unimernet(2503),解决多行公式中换行丢失的问题
    • 易用性优化
      • 通过使用paddleocr2torch,完全替代paddle框架以及paddleocr在项目中的使用,解决了paddletorch的冲突问题,和由于paddle框架导致的线程不安全问题
      • 解析过程增加实时进度条显示,精准把握解析进度,让等待不再痛苦

New Contributors

Full Changelog: magic_pdf-1.2.2-released...magic_pdf-1.3.0-released

magic_pdf-1.2.2-released

04 Mar 13:08
6f571bb
Compare
Choose a tag to compare

What's Changed

  • refactor(magic_pdf): improve paragraph splitting logic and update dependencies by @myhloli in #1838

Full Changelog: magic_pdf-1.2.1-released...magic_pdf-1.2.2-released

magic_pdf-1.2.1-released

03 Mar 10:02
9f6b536
Compare
Choose a tag to compare

What's Changed

fixed several bugs:

  • Fixed the impact on punctuation marks during full-width to half-width conversion of letters and numbers #1796
  • Fixed caption matching inaccuracies in certain scenarios #1815
  • Fixed formula span loss issues in certain scenarios #1809

修复了一些问题:

  • 修复在字母与数字的全角转半角操作时对标点符号的影响 #1796
  • 修复在某些情况下caption的匹配不准确问题 #1815
  • 修复在某些情况下的公式span丢失问题 #1809

Full Changelog: magic_pdf-1.2.0-released...magic_pdf-1.2.1-released

magic_pdf-1.2.0-released

27 Feb 03:03
21451be
Compare
Choose a tag to compare

What's Changed

This version includes several fixes and improvements to enhance parsing efficiency and accuracy:

  • Performance Optimization
    • Increased classification speed for PDF documents in auto mode.
  • Parsing Optimization
    • Improved parsing logic for documents containing watermarks, significantly enhancing the parsing results for such documents.
    • Enhanced the matching logic for multiple images/tables and captions within a single page, improving the accuracy of image-text matching in complex layouts.
  • Bug Fixes
    • Fixed an issue where image/table spans were incorrectly filled into text blocks under certain conditions.
    • Resolved an issue where title blocks were empty in some cases.

这个版本我们修复了一些问题,提升了解析的效率与精度:

  • 性能优化
    • auto模式下pdf文档的分类速度提升
    • 在华为昇腾 NPU 加速模式下,添加高性能插件支持,常见场景下端到端加速可达 300% 申请链接
  • 解析优化
    • 优化对包含水印文档的解析逻辑,显著提升包含水印文档的解析效果
    • 改进了单页内多个图像/表格与caption的匹配逻辑,提升了复杂布局下图文匹配的准确性
  • 问题修复
    • 修复在某些情况下图片/表格span被填充进textblock导致的异常
    • 修复在某些情况下标题block为空的问题

New Contributors

Full Changelog: magic_pdf-1.1.0-released...magic_pdf-1.2.0-released

magic_pdf-1.1.0-released

23 Jan 10:01
19f72c2
Compare
Choose a tag to compare

What's Changed

In this version we have focused on improving parsing accuracy and efficiency:

  • Model capability upgrade (requires re-executing the model download process to obtain incremental updates of model files)
    • The layout recognition model has been upgraded to the latest doclayout_yolo(2501) model, improving layout recognition accuracy.
    • The formula parsing model has been upgraded to the latest unimernet(2501) model, improving formula recognition accuracy.
  • Performance optimization
    • On devices that meet certain configuration requirements (16GB+ VRAM), by optimizing resource usage and restructuring the processing pipeline, overall parsing speed has been increased by more than 50%.

在这个版本我们重点提升了解析的精度与效率:

  • 模型能力升级(需重新执行模型下载流程以获得模型文件的增量更新)
    • 布局识别模型升级到最新的doclayout_yolo(2501)模型,提升了layout识别精度
    • 公式解析模型升级到最新的unimernet(2501)模型,提升了公式识别精度
  • 性能优化
    • 在配置满足一定条件(显存16GB+)的设备上,通过优化资源占用和重构处理流水线,整体解析速度提升50%以上

New Contributors

Full Changelog: magic_pdf-1.0.1-released...magic_pdf-1.1.0-released