技术文档中含有数字加百分号的字符串一般都会被作为公式处理 #2346
klizet
started this conversation in
Show and tell
Replies: 2 comments
-
这种非论文的文档直接关闭公式识别的功能就好了 |
Beta Was this translation helpful? Give feedback.
0 replies
-
在magic-model.py的函数get_all_spans中做了一些回退,
有需要的自取吧 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Description of the bug | 错误描述
技术文档中很多数字加百分号的字符序列,mineru一般都会将其认作公式,但是很多场景下这样的字符序列只是用来表述一串百分比,而同样的$+数字便不会被认为是公式,这样的场景在金融相关pdf文件中非常多。
是否有选项可以关闭或者修正这样的偏好呢?
How to reproduce the bug | 如何复现
source: https://www.modeln.com/wp-content/uploads/2024/09/report_ht-industryoutlook.pdf
P5中 “This represents an improvement from the initial forecast of a-20% decline to a revised estimate of -8%.”的
a-20%被解析成
$\partial-20\,\%$
-8%被解析成
- $-8\%$
第一个的a相关解析可能属于另外一个bug。
Operating system | 操作系统
Windows
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
1.0.x
Device mode | 设备模式
cuda
Beta Was this translation helpful? Give feedback.
All reactions