Skip to content

Commit 87df824

Browse files
committed
feat: adapt rapidocr 2.0
1 parent 549d11e commit 87df824

File tree

10 files changed

+108
-150
lines changed

10 files changed

+108
-150
lines changed

README.md

Lines changed: 10 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
<h1><b><i>RapidOCR 📄 PDF</i></b></h1>
44
</div>
55

6-
<a href=""><img src="https://img.shields.io/badge/Python->=3.6,<3.12-aff.svg"></a>
6+
<a href=""><img src="https://img.shields.io/badge/Python->=3.6-aff.svg"></a>
77
<a href=""><img src="https://img.shields.io/badge/OS-Linux%2C%20Win%2C%20Mac-pink.svg"></a>
88
<a href="https://pypi.org/project/rapidocr-pdf/"><img alt="PyPI" src="https://img.shields.io/pypi/v/rapidocr-pdf"></a>
99
<a href="https://pepy.tech/project/rapidocr-pdf"><img src="https://static.pepy.tech/personalized-badge/rapidocr-pdf?period=total&units=abbreviation&left_color=grey&right_color=blue&left_text=Downloads"></a>
@@ -33,33 +33,27 @@ C & D --> E(结果)
3333
### 安装
3434

3535
```bash
36-
# 基于CPU 依赖rapidocr_onnxruntime
37-
pip install rapidocr_pdf[onnxruntime]
38-
39-
# 基于CPU 依赖rapidocr_openvino 更快
40-
pip install rapidocr_pdf[openvino]
41-
42-
# 基于GPU 依赖rapidocr_paddle
43-
# 1.安装 PaddlePaddle 框架 GPU 版, 参见: https://www.paddlepaddle.org.cn/
44-
# 2.安装 rapidocr_pdf[paddle]
45-
pip install rapidocr_pdf[paddle]
36+
pip install rapidocr_pdf
4637
```
4738

4839
### 使用
4940

50-
脚本使用
41+
#### 脚本使用
42+
43+
`rapidocr_pdf>=0.2.0`中,已经适配`rapidocr>=2.0.0`版本,可以通过参数来使用不同OCR推理引擎来提速。
44+
下面的`ocr_params`为示例参数,详细请参见RapidOCR官方文档:[docs](https://rapidai.github.io/RapidOCRDocs/main/install_usage/rapidocr/usage/#_4)
5145

5246
```python
53-
from rapidocr_pdf import PDFExtracter
47+
from rapidocr_pdf import RapidOCRPDF
5448

55-
pdf_extracter = PDFExtracter()
49+
pdf_extracter = RapidOCRPDF(ocr_params={"Global.with_torch": True})
5650

57-
pdf_path = 'tests/test_files/direct_and_image.pdf'
51+
pdf_path = "tests/test_files/direct_and_image.pdf"
5852
texts = pdf_extracter(pdf_path, force_ocr=False)
5953
print(texts)
6054
```
6155

62-
命令行使用
56+
#### 命令行使用
6357

6458
```bash
6559
$ rapidocr_pdf -h

demo.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# -*- encoding: utf-8 -*-
22
# @Author: SWHL
33
# @Contact: [email protected]
4-
from rapidocr_pdf import PDFExtracter
4+
from rapidocr_pdf import RapidOCRPDF
55

6-
pdf_extracter = PDFExtracter(print_verbose=True)
6+
pdf_extracter = RapidOCRPDF(ocr_params={"Global.with_torch": True})
77

88
pdf_path = "tests/test_files/direct_and_image.pdf"
99
texts = pdf_extracter(pdf_path, force_ocr=False)

docs/docs.md

Lines changed: 1 addition & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1,56 +1 @@
1-
## RapidOCRPDF
2-
<p>
3-
<a href=""><img src="https://img.shields.io/badge/Python->=3.6,<3.12-aff.svg"></a>
4-
<a href=""><img src="https://img.shields.io/badge/OS-Linux%2C%20Win%2C%20Mac-pink.svg"></a>
5-
<a href="https://pypi.org/project/rapidocr-pdf/"><img alt="PyPI" src="https://img.shields.io/pypi/v/rapidocr-pdf"></a>
6-
<a href="https://pepy.tech/project/rapidocr-pdf"><img src="https://static.pepy.tech/personalized-badge/rapidocr-pdf?period=total&units=abbreviation&left_color=grey&right_color=blue&left_text=Downloads"></a>
7-
<a href="https://semver.org/"><img alt="SemVer2.0" src="https://img.shields.io/badge/SemVer-2.0-brightgreen"></a>
8-
<a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>
9-
<a href="https://choosealicense.com/licenses/apache-2.0/"><img alt="GitHub" src="https://img.shields.io/github/license/RapidAI/RapidOCRPDF"></a>
10-
</p>
11-
12-
- Relying on [RapidOCR](https://github.com/RapidAI/RapidOCR), quickly extract text from PDF, including scanned PDF and encrypted PDF.
13-
- Layout restore is not included for now.
14-
15-
16-
### 1. Install package by pypi.
17-
```bash
18-
# base rapidocr_onnxruntime
19-
pip install rapidocr_pdf[onnxruntime]
20-
21-
# base rapidocr_openvino
22-
pip install rapidocr_pdf[openvino]
23-
```
24-
25-
### 2. Usage
26-
- Run by script.
27-
```python
28-
from rapidocr_pdf import PDFExtracter
29-
30-
pdf_extracter = PDFExtracter()
31-
32-
pdf_path = 'tests/test_files/direct_and_image.pdf'
33-
texts = pdf_extracter(pdf_path)
34-
print(texts)
35-
```
36-
- Run by command line.
37-
```bash
38-
$ rapidocr_pdf -h
39-
usage: rapidocr_pdf [-h] [-path FILE_PATH]
40-
41-
options:
42-
-h, --help show this help message and exit
43-
-path FILE_PATH, --file_path FILE_PATH
44-
File path, PDF or images
45-
46-
$ rapidocr_pdf -path tests/test_files/direct_and_image.pdf
47-
```
48-
### 3. Ouput format.
49-
- **Input**`Union[str, Path, bytes]`
50-
- **Output**`List` \[**Page num**, **Page content** + **score**\], :
51-
```python
52-
[
53-
['0', '达大学拉斯维加斯分校)的一次中文评测中获得最', '0.8969868'],
54-
['1', 'ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network∗\nYuliang Liu‡†', '0.8969868'],
55-
]
56-
```
1+
See [link](https://github.com/RapidAI/RapidOCRPDF) for details.

rapidocr_pdf/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# -*- encoding: utf-8 -*-
22
# @Author: SWHL
33
# @Contact: [email protected]
4-
from .main import PDFExtracter, PDFExtracterError
4+
from .main import RapidOCRPDF, RapidOCRPDFError

rapidocr_pdf/logger.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# -*- encoding: utf-8 -*-
2+
# @Author: SWHL
3+
# @Contact: [email protected]
4+
import logging
5+
6+
import colorlog
7+
8+
9+
class Logger:
10+
def __init__(self, log_level=logging.DEBUG, logger_name=None):
11+
self.logger = logging.getLogger(logger_name)
12+
self.logger.setLevel(log_level)
13+
self.logger.propagate = False
14+
15+
formatter = colorlog.ColoredFormatter(
16+
"%(log_color)s[%(levelname)s] %(asctime)s [RapidOCR] %(filename)s:%(lineno)d: %(message)s",
17+
log_colors={
18+
"DEBUG": "cyan",
19+
"INFO": "green",
20+
"WARNING": "yellow",
21+
"ERROR": "red",
22+
"CRITICAL": "red,bg_white",
23+
},
24+
)
25+
26+
if not self.logger.handlers:
27+
console_handler = logging.StreamHandler()
28+
console_handler.setFormatter(formatter)
29+
30+
for handler in self.logger.handlers:
31+
self.logger.removeHandler(handler)
32+
33+
console_handler.setLevel(log_level)
34+
self.logger.addHandler(console_handler)
35+
36+
def get_log(self):
37+
return self.logger

rapidocr_pdf/main.py

Lines changed: 33 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -2,60 +2,40 @@
22
# @Author: SWHL
33
# @Contact: [email protected]
44
import argparse
5-
import warnings
65
from pathlib import Path
7-
from typing import Dict, List, Tuple, Union
6+
from typing import Dict, List, Optional, Tuple, Union
87

98
import cv2
10-
import filetype
119
import fitz
1210
import numpy as np
11+
from rapidocr import RapidOCR
1312

14-
from .utils import import_package
13+
from .logger import Logger
14+
from .utils import which_type
1515

1616

17-
class PDFExtracter:
18-
def __init__(self, dpi=200, **ocr_kwargs):
17+
class RapidOCRPDF:
18+
def __init__(self, dpi=200, ocr_params: Optional[Dict] = None):
1919
self.dpi = dpi
20-
21-
ocr_engine = import_package("rapidocr_onnxruntime")
22-
if ocr_engine is None:
23-
ocr_engine = import_package("rapidocr_openvino")
24-
25-
if ocr_engine is None:
26-
ocr_engine = import_package("rapidocr_paddle")
27-
28-
if ocr_engine is not None:
29-
ocr_kwargs.update({
30-
"det_use_cuda": True,
31-
"cls_use_cuda": True,
32-
"rec_use_cuda": True
33-
})
34-
else:
35-
raise ModuleNotFoundError(
36-
"Can't find the rapidocr_onnxruntime/rapidocr_openvino/rapidocr_paddle package.\n Please pip install rapidocr_onnxruntime to run the code."
37-
)
38-
39-
self.text_sys = ocr_engine.RapidOCR(**ocr_kwargs)
20+
self.ocr_engine = RapidOCR(params=ocr_params)
4021
self.empty_list = []
22+
self.logger = Logger(logger_name=__name__).get_log()
4123

4224
def __call__(
43-
self,
44-
content: Union[str, Path, bytes],
45-
force_ocr: bool = False,
25+
self, content: Union[str, Path, bytes], force_ocr: bool = False
4626
) -> List[List[Union[str, str, str]]]:
4727
try:
48-
file_type = self.which_type(content)
28+
file_type = which_type(content)
4929
except (FileExistsError, TypeError) as e:
50-
raise PDFExtracterError("The input content is empty.") from e
30+
raise RapidOCRPDFError("The input content is empty.") from e
5131

5232
if file_type != "pdf":
53-
raise PDFExtracterError("The file type is not PDF format.")
33+
raise RapidOCRPDFError("The file type is not PDF format.")
5434

5535
try:
5636
pdf_data = self.load_pdf(content)
57-
except PDFExtracterError as e:
58-
warnings.warn(str(e))
37+
except RapidOCRPDFError as e:
38+
self.logger.error(e)
5939
return self.empty_list
6040

6141
txts_dict, need_ocr_idxs = self.extract_texts(pdf_data, force_ocr)
@@ -69,7 +49,7 @@ def __call__(
6949
def load_pdf(pdf_content: Union[str, Path, bytes]) -> bytes:
7050
if isinstance(pdf_content, (str, Path)):
7151
if not Path(pdf_content).exists():
72-
raise PDFExtracterError(f"{pdf_content} does not exist.")
52+
raise RapidOCRPDFError(f"{pdf_content} does not exist.")
7353

7454
with open(pdf_content, "rb") as f:
7555
data = f.read()
@@ -78,7 +58,7 @@ def load_pdf(pdf_content: Union[str, Path, bytes]) -> bytes:
7858
if isinstance(pdf_content, bytes):
7959
return pdf_content
8060

81-
raise PDFExtracterError(f"{type(pdf_content)} is not in [str, Path, bytes].")
61+
raise RapidOCRPDFError(f"{type(pdf_content)} is not in [str, Path, bytes].")
8262

8363
def extract_texts(self, pdf_data: bytes, force_ocr: bool) -> Tuple[Dict, List]:
8464
texts, need_ocr_idxs = {}, []
@@ -107,20 +87,19 @@ def convert_img(page):
10787
with fitz.open(stream=pdf_data) as doc:
10888
for i in need_ocr_idxs:
10989
img = convert_img(doc[i])
110-
preds, _ = self.text_sys(img)
111-
if preds:
112-
text = []
113-
confidences = []
114-
for pred in preds:
115-
_, rec_res, confidence = pred
116-
text.append(rec_res)
117-
confidences.append(float(confidence))
118-
119-
avg_confidence = np.mean(confidences) if confidences else 0.0
120-
ocr_res[str(i)] = {
121-
"text": "\n".join(text),
122-
"avg_confidence": avg_confidence
123-
}
90+
91+
preds = self.ocr_engine(img)
92+
if preds.txts is None:
93+
continue
94+
95+
avg_score = (
96+
sum(preds.scores) / len(preds.scores) if preds.scores else 0.0
97+
)
98+
99+
ocr_res[str(i)] = {
100+
"text": "\n".join(preds.txts),
101+
"avg_confidence": avg_score,
102+
}
124103
return ocr_res
125104

126105
def merge_direct_ocr(self, txts_dict: Dict, ocr_res_dict: Dict) -> List[List[str]]:
@@ -131,25 +110,14 @@ def merge_direct_ocr(self, txts_dict: Dict, ocr_res_dict: Dict) -> List[List[str
131110
for page_idx, ocr_data in ocr_res_dict.items():
132111
final_result[page_idx] = {
133112
"text": ocr_data["text"],
134-
"avg_confidence": ocr_data["avg_confidence"]
113+
"avg_confidence": ocr_data["avg_confidence"],
135114
}
136115

137116
final_result = dict(sorted(final_result.items(), key=lambda x: int(x[0])))
138-
return [[k, v["text"], str(v["avg_confidence"])] for k, v in final_result.items()]
139-
140-
@staticmethod
141-
def which_type(content: Union[bytes, str, Path]) -> str:
142-
if isinstance(content, (str, Path)) and not Path(content).exists():
143-
raise FileExistsError(f"{content} does not exist.")
144-
145-
kind = filetype.guess(content)
146-
if kind is None:
147-
raise TypeError(f"The type of {content} does not support.")
148-
149-
return kind.extension
117+
return [[k, v["text"], v["avg_confidence"]] for k, v in final_result.items()]
150118

151119

152-
class PDFExtracterError(Exception):
120+
class RapidOCRPDFError(Exception):
153121
pass
154122

155123

@@ -167,7 +135,7 @@ def main():
167135
)
168136
args = parser.parse_args()
169137

170-
pdf_extracter = PDFExtracter()
138+
pdf_extracter = RapidOCRPDF()
171139

172140
try:
173141
result = pdf_extracter(args.file_path, args.force_ocr)

rapidocr_pdf/utils.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22
# @Author: SWHL
33
# @Contact: [email protected]
44
import importlib
5+
from pathlib import Path
6+
from typing import Union
7+
8+
import filetype
59

610

711
def import_package(name, package=None):
@@ -10,3 +14,14 @@ def import_package(name, package=None):
1014
return module
1115
except ModuleNotFoundError:
1216
return None
17+
18+
19+
def which_type(content: Union[bytes, str, Path]) -> str:
20+
if isinstance(content, (str, Path)) and not Path(content).exists():
21+
raise FileExistsError(f"{content} does not exist.")
22+
23+
kind = filetype.guess(content)
24+
if kind is None:
25+
raise TypeError(f"The type of {content} does not support.")
26+
27+
return kind.extension

requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,4 @@
11
filetype>=1.2.0
22
pymupdf
3+
rapidocr
4+
colorlog

setup.py

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -66,14 +66,11 @@ def get_readme():
6666
"Programming Language :: Python :: 3.9",
6767
"Programming Language :: Python :: 3.10",
6868
"Programming Language :: Python :: 3.11",
69+
"Programming Language :: Python :: 3.12",
70+
"Programming Language :: Python :: 3.13",
6971
],
70-
python_requires=">=3.6,<3.12",
72+
python_requires=">=3.6",
7173
entry_points={
7274
"console_scripts": [f"{MODULE_NAME}={MODULE_NAME}.main:main"],
7375
},
74-
extras_require={
75-
"onnxruntime": ["rapidocr_onnxruntime"],
76-
"openvino": ["rapidocr_openvino"],
77-
"paddle": ["rapidocr_paddle"],
78-
},
7976
)

0 commit comments

Comments
 (0)