-
Notifications
You must be signed in to change notification settings - Fork 11
Expand file tree
/
Copy pathpdfocr
More file actions
executable file
·51 lines (45 loc) · 1.1 KB
/
pdfocr
File metadata and controls
executable file
·51 lines (45 loc) · 1.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#!/bin/bash
# written by Shotaro Fujimoto
#=#=#=
# Convert a scanned pdf to the pdf file which contains the embedded
# transparent texts generated by OCR engine 'tesseract'.
#
# **Require**
#
# * tesseract
# * [GitHub repository](https://github.com/tesseract-ocr/tesseract)
# * Install with `apt` command
# ```
# sudo apt-get install tesseract-ocr tesseract-ocr-jpn
# ```
#
# * pdftk
# * pdftoppm
#
# **Usage**
#
# ```
# pdfocr foo.pdf
# ```
#
# **TODO**
#
# * receive some options
#=#=
target="$1"
targetwithoutext="${1%.pdf}"
# pdftoppm
mkdir "${targetwithoutext}.d"
# - png
pdftoppm -png "${target}" "${targetwithoutext}.d/page"
# - ppm
# pdftoppm "${target}" "${targetwithoutext}.d/page"
# tesseract
# - png
find "./${targetwithoutext}.d" -type f -name "*.png" | sed 's/\.png$//' | xargs -P8 -n1 -I% tesseract %.png % -l eng+jpn pdf
# - ppm
# find "./${targetwithoutext}.d" -type f -name "*.ppm" | sed 's/\.ppm$//' | xargs -P8 -n1 -I% tesseract %.ppm % -l eng+jpn pdf
# Merge to one pdf file
pdftk "./${targetwithoutext}.d"/*.pdf cat output "${targetwithoutext}-ocr.pdf"
# Clear
rm -r "./${targetwithoutext}.d"