Skip to content

Commit eb6d523

Browse files
committed
Introducing new APIs to assist with detecting and reporting overlarge input images. Available to both userland and tesseract internal code, these can be used to report & early fail images which are too large to fit in memory.
Some very lenient defaults are used for the memory pressure allowance (1.5 GByte for 32bit builds, 64GByte for 64bit builds) but this can be tweaked to your liking and local machine shop via Tesseract Global Variable `allowed_image_memory_capacity` (DOUBLE type). NOTE: the allowance limit can be effectively removed by setting this variable to an 'insane' value, e.g. `1.0e30`. HOWEVER, the CheckAndReportIfImageTooLarge() API will still fire for images with either width or high dimension >= TDIMENSION_MAX, which in the default built is the classic INT16_MAX (32767px); when compiled with defined(LARGE_IMAGES), then the width/height limit is raised to 24bit i.e. ~ 16.7 Mpx, which would then tolerate images smaller than 16777216 x 16777216px. (This latter part is a work-in-progress.) Related: - tesseract-ocr#3184 - tesseract-ocr#3885 - tesseract-ocr#3435 (pullreq by @stweil -- WIP)
1 parent a378f34 commit eb6d523

File tree

10 files changed

+652
-19
lines changed

10 files changed

+652
-19
lines changed

Diff for: include/tesseract/baseapi.h

+32
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
#include "unichar.h"
2929

3030
#include <tesseract/version.h>
31+
#include <tesseract/memcost_estimate.h> // for ImageCostEstimate
3132

3233
#include <cstdio>
3334
#include <tuple> // for std::tuple
@@ -119,6 +120,37 @@ class TESS_API TessBaseAPI {
119120
void SetVisibleImage(Pix *pix);
120121
Pix* GetVisibleImage();
121122

123+
/**
124+
* Return a memory capacity cost estimate for the given image dimensions and
125+
* some heuristics re tesseract behaviour, e.g. input images will be normalized/greyscaled,
126+
* then thresholded, all of which will be kept in memory while the session runs.
127+
*
128+
* Also uses the Tesseract Variable `allowed_image_memory_capacity` to indicate
129+
* whether the estimated cost is oversized --> `cost.is_too_large()`
130+
*
131+
* For user convenience, static functions are provided:
132+
* the static functions MAY be used by userland code *before* the high cost of
133+
* instantiating a Tesseract instance is incurred.
134+
*/
135+
static ImageCostEstimate EstimateImageMemoryCost(int image_width, int image_height, float allowance = 1.0e30f /* a.k.a.dont_care, use system limit and be done */ );
136+
static ImageCostEstimate EstimateImageMemoryCost(const Pix* pix, float allowance = 1.0e30f /* a.k.a. dont_care, use system limit and be done */ );
137+
138+
/**
139+
* Ditto, but this API may be invoked after SetInputImage() or equivalent has been called
140+
* and reports the cost estimate for the current instance/image.
141+
*/
142+
ImageCostEstimate EstimateImageMemoryCost() const;
143+
144+
/**
145+
* Helper, which may be invoked after SetInputImage() or equivalent has been called:
146+
* reports the cost estimate for the current instance/image via `tprintf()` and returns
147+
* `true` when the cost is expected to be too high.
148+
*
149+
* You can use this as a fast pre-flight check. Many major tesseract APIs perform
150+
* this same check as part of their startup routine.
151+
*/
152+
bool CheckAndReportIfImageTooLarge(const Pix* pix = nullptr /* default: use GetInputImage() data */ ) const;
153+
122154
/** Set the name of the bonus output files. Needed only for debugging. */
123155
void SetOutputName(const char *name);
124156
const std::string &GetOutputName();

Diff for: include/tesseract/memcost_estimate.h

+62
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
/**********************************************************************
2+
* File: memcost_estimate.h
3+
* Description: Inline routines and macros for serialisation functions
4+
* Author: Ger Hobbelt
5+
*
6+
* (C) Copyright 1990, Hewlett-Packard Ltd.
7+
** Licensed under the Apache License, Version 2.0 (the "License");
8+
** you may not use this file except in compliance with the License.
9+
** You may obtain a copy of the License at
10+
** http://www.apache.org/licenses/LICENSE-2.0
11+
** Unless required by applicable law or agreed to in writing, software
12+
** distributed under the License is distributed on an "AS IS" BASIS,
13+
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
** See the License for the specific language governing permissions and
15+
** limitations under the License.
16+
*
17+
**********************************************************************/
18+
19+
#ifndef T_MEMCOST_ESTIMATE_H
20+
#define T_MEMCOST_ESTIMATE_H
21+
22+
#include <string>
23+
24+
namespace tesseract {
25+
26+
// Image memory capacity cost estimate report. Cost is measured in BYTES. Cost is reported
27+
// (`to_string()`) in GBYTES.
28+
//
29+
// Uses `allowed_image_memory_capacity` plus some compile-time heuristics to indicate
30+
// whether the estimated cost is oversized --> `cost.is_too_large()`
31+
struct ImageCostEstimate {
32+
float cost;
33+
34+
protected:
35+
float allowed_image_memory_capacity;
36+
37+
public:
38+
ImageCostEstimate()
39+
: ImageCostEstimate(0.0f, 1e30f) {
40+
}
41+
42+
ImageCostEstimate(float c, float allowance = 1e30f);
43+
44+
static float get_max_system_allowance();
45+
46+
float get_max_allowance() const;
47+
48+
void set_max_allowance(float allowance);
49+
50+
bool is_too_large() const;
51+
52+
std::string to_string() const;
53+
54+
// implicit conversion
55+
operator std::string() const;
56+
57+
static std::string capacity_to_string(float cost);
58+
};
59+
60+
} // namespace tesseract.
61+
62+
#endif

Diff for: src/api/baseapi.cpp

+82-4
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@
7575
#include <set> // for std::pair
7676
#include <sstream> // for std::stringstream
7777
#include <vector> // for std::vector
78+
#include <cfloat>
7879

7980
#include <allheaders.h> // for pixDestroy, boxCreate, boxaAddBox, box...
8081
#ifdef HAVE_LIBCURL
@@ -106,14 +107,15 @@ namespace tesseract {
106107

107108
FZ_HEAPDBG_TRACKER_SECTION_START_MARKER(_)
108109

109-
static BOOL_VAR(stream_filelist, false, "Stream a filelist from stdin");
110-
static STRING_VAR(document_title, "", "Title of output document (used for hOCR and PDF output)");
110+
BOOL_VAR(stream_filelist, false, "Stream a filelist from stdin");
111+
STRING_VAR(document_title, "", "Title of output document (used for hOCR and PDF output)");
111112
#ifdef HAVE_LIBCURL
112-
static INT_VAR(curl_timeout, 0, "Timeout for curl in seconds");
113+
INT_VAR(curl_timeout, 0, "Timeout for curl in seconds");
113114
#endif
114115
BOOL_VAR(debug_all, false, "Turn on all the debugging features");
115116
STRING_VAR(vars_report_file, "+", "Filename/path to write the 'Which -c variables were used' report. File may be 'stdout', '1' or '-' to be output to stdout. File may be 'stderr', '2' or '+' to be output to stderr. Empty means no report will be produced.");
116117
BOOL_VAR(report_all_variables, true, "When reporting the variables used (via 'vars_report_file') also report all *unused* variables, hence the report will always list *all available variables.");
118+
double_VAR(allowed_image_memory_capacity, ImageCostEstimate::get_max_system_allowance(), "Set maximum memory allowance for image data: this will be used as part of a sanity check for oversized input images.");
117119

118120

119121
/** Minimum sensible image size to be worth running tesseract. */
@@ -295,6 +297,66 @@ void TessBaseAPI::SetVisibleImageFilename(const char* name) {
295297
visible_image_file_ = name ? name : "";
296298
}
297299

300+
/**
301+
* Return a memory capacity cost estimate for the given image dimensions and
302+
* some heuristics re tesseract behaviour, e.g. input images will be normalized/greyscaled,
303+
* then thresholded, all of which will be kept in memory while the session runs.
304+
*
305+
* Also uses the Tesseract Variable `allowed_image_memory_capacity` to indicate
306+
* whether the estimated cost is oversized --> `cost.is_too_large()`
307+
*
308+
* For user convenience, static functions are provided:
309+
* the static functions MAY be used by userland code *before* the high cost of
310+
* instantiating a Tesseract instance is incurred.
311+
*/
312+
ImageCostEstimate TessBaseAPI::EstimateImageMemoryCost(int image_width, int image_height, float allowance) {
313+
// The heuristics used:
314+
//
315+
// we reckon with leptonica Pix storage at 4 bytes per pixel,
316+
// tesseract storing (worst case) 3 different images: original, greyscale, binary thresholded,
317+
// we DO NOT reckon with the extra image that may serve as background for PDF outputs, etc.
318+
// we DO NOT reckon with the memory cost for the OCR match tree, etc.
319+
// However, we attempt a VERY ROUGH estimate by calculating a 20% overdraft for internal operations'
320+
// storage costs.
321+
float cost = 4 * 3 * 1.20f;
322+
cost *= image_width;
323+
cost *= image_height;
324+
325+
if (allowed_image_memory_capacity > 0.0) {
326+
// any rediculous input values will be replaced by the Tesseract configuration value:
327+
if (allowance > allowed_image_memory_capacity || allowance <= 0.0)
328+
allowance = allowed_image_memory_capacity;
329+
}
330+
331+
return ImageCostEstimate(cost, allowance);
332+
}
333+
334+
ImageCostEstimate TessBaseAPI::EstimateImageMemoryCost(const Pix* pix, float allowance) {
335+
auto w = pixGetWidth(pix);
336+
auto h = pixGetHeight(pix);
337+
return EstimateImageMemoryCost(w, h, allowance);
338+
}
339+
340+
/**
341+
* Ditto, but this API may be invoked after SetInputImage() or equivalent has been called
342+
* and reports the cost estimate for the current instance/image.
343+
*/
344+
ImageCostEstimate TessBaseAPI::EstimateImageMemoryCost() const {
345+
return tesseract_->EstimateImageMemoryCost();
346+
}
347+
348+
/**
349+
* Helper, which may be invoked after SetInputImage() or equivalent has been called:
350+
* reports the cost estimate for the current instance/image via `tprintf()` and returns
351+
* `true` when the cost is expected to be too high.
352+
*
353+
* You can use this as a fast pre-flight check. Many major tesseract APIs perform
354+
* this same check as part of their startup routine.
355+
*/
356+
bool TessBaseAPI::CheckAndReportIfImageTooLarge(const Pix* pix) const {
357+
return tesseract_->CheckAndReportIfImageTooLarge(pix);
358+
}
359+
298360
/** Set the name of the output files. Needed only for debugging. */
299361
void TessBaseAPI::SetOutputName(const char *name) {
300362
output_file_ = name ? name : "";
@@ -1448,6 +1510,22 @@ bool TessBaseAPI::ProcessPage(Pix *pix, int page_index, const char *filename,
14481510

14491511
SetImage(pix);
14501512

1513+
// Before wee start to do *real* work, do a preliminary sanity check re expected memory pressure.
1514+
// The check MAY recur in some (semi)public APIs that MAY be called later, but this is the big one
1515+
// and it's a simple check at negligible cost, saving us some headaches when we start feeding large
1516+
// material to the Tesseract animal.
1517+
//
1518+
// TODO: rescale overlarge input images? Or is that left to userland code? (as it'll be pretty fringe anyway)
1519+
{
1520+
auto cost = TessBaseAPI::EstimateImageMemoryCost(pix);
1521+
std::string cost_report = cost;
1522+
tprintf("Estimated memory pressure: {} for input image size {} x {} px\n", cost_report, pixGetWidth(pix), pixGetHeight(pix));
1523+
1524+
if (CheckAndReportIfImageTooLarge(pix)) {
1525+
return false; // fail early
1526+
}
1527+
}
1528+
14511529
// Image preprocessing on image
14521530
// Grayscale normalization
14531531
int graynorm_mode = tesseract_->preprocess_graynorm_mode;
@@ -1465,7 +1543,7 @@ bool TessBaseAPI::ProcessPage(Pix *pix, int page_index, const char *filename,
14651543
}
14661544

14671545
// Recognition
1468-
1546+
14691547
bool failed = false;
14701548

14711549
if (tesseract_->tessedit_pageseg_mode == PSM_AUTO_ONLY) {

Diff for: src/ccmain/tesseractclass.cpp

+41
Original file line numberDiff line numberDiff line change
@@ -668,4 +668,45 @@ void Tesseract::PrepareForTessOCR(BLOCK_LIST *block_list, Tesseract *osd_tess, O
668668
splitter_.Clear();
669669
}
670670

671+
// Return a memory capacity cost estimate for the given image / current original image.
672+
//
673+
// uses the current original image for the estimate, i.e. tells you the cost estimate of this run:
674+
ImageCostEstimate Tesseract::EstimateImageMemoryCost(const Pix* pix) const {
675+
// default: use pix_original() data
676+
if (pix == nullptr) {
677+
pix = pix_original();
678+
}
679+
680+
return TessBaseAPI::EstimateImageMemoryCost(pix, allowed_image_memory_capacity);
681+
}
682+
683+
// Helper, which may be invoked after SetInputImage() or equivalent has been called:
684+
// reports the cost estimate for the current instance/image via `tprintf()` and returns
685+
// `true` when the cost is expected to be too high.
686+
bool Tesseract::CheckAndReportIfImageTooLarge(const Pix* pix) const {
687+
// default: use pix_original() data
688+
if (pix == nullptr) {
689+
pix = pix_original();
690+
}
691+
692+
auto w = pixGetWidth(pix);
693+
auto h = pixGetHeight(pix);
694+
return CheckAndReportIfImageTooLarge(w, h);
695+
}
696+
697+
bool Tesseract::CheckAndReportIfImageTooLarge(int width, int height) const {
698+
auto cost = TessBaseAPI::EstimateImageMemoryCost(width, height, allowed_image_memory_capacity);
699+
700+
if (debug_all) {
701+
tprintf("Image size & memory cost estimate: {} x {} px, estimated cost {} vs. {} allowed capacity.\n",
702+
width, height, cost.to_string(), ImageCostEstimate::capacity_to_string(allowed_image_memory_capacity));
703+
}
704+
705+
if (width >= TDIMENSION_MAX || height >= TDIMENSION_MAX || cost.is_too_large()) {
706+
tprintf("ERROR: Image is too large: ({} x {} px, {})\n", width, height, cost.to_string());
707+
return true;
708+
}
709+
return false;
710+
}
711+
671712
} // namespace tesseract

Diff for: src/ccmain/tesseractclass.h

+19-4
Original file line numberDiff line numberDiff line change
@@ -45,10 +45,11 @@
4545
#include "wordrec.h" // for Wordrec
4646
#include "imagefind.h" // for ImageFind
4747
#include "linefind.h" // for LineFinder
48-
#include "genericvector.h" // for PointerVector (ptr only)
48+
#include "genericvector.h" // for PointerVector (ptr only)
4949

5050
#include <tesseract/publictypes.h> // for OcrEngineMode, PageSegMode, OEM_L...
5151
#include <tesseract/unichar.h> // for UNICHAR_ID
52+
#include <tesseract/memcost_estimate.h> // for ImageCostEstimate
5253

5354
#include <allheaders.h> // for pixDestroy, pixGetWidth, pixGetHe...
5455

@@ -181,6 +182,7 @@ struct WordData {
181182
PointerVector<WERD_RES> lang_words;
182183
};
183184

185+
184186
// Definition of a Tesseract WordRecognizer. The WordData provides the context
185187
// of row/block, in_word holds an initialized, possibly pre-classified word,
186188
// that the recognizer may or may not consume (but if so it sets
@@ -231,9 +233,11 @@ class TESS_API Tesseract: public Wordrec {
231233
pix_grey_.destroy();
232234
pix_grey_ = grey_pix;
233235
}
234-
//DebugPixa &pix_debug() {
235-
// return pixa_debug_;
236-
//}
236+
#if 0
237+
DebugPixa &pix_debug() {
238+
return pixa_debug_;
239+
}
240+
#endif
237241
Image pix_original() const {
238242
return pix_original_;
239243
}
@@ -247,6 +251,17 @@ class TESS_API Tesseract: public Wordrec {
247251
}
248252
}
249253

254+
// Return a memory capacity cost estimate for the given image / current original image.
255+
//
256+
// (unless overridden by the `pix` argument) uses the current original image for the estimate,
257+
// i.e. tells you the cost estimate of this run:
258+
ImageCostEstimate EstimateImageMemoryCost(const Pix* pix = nullptr /* default: use pix_original() data */) const;
259+
// Helper, which may be invoked after SetInputImage() or equivalent has been called:
260+
// reports the cost estimate for the current instance/image via `tprintf()` and returns
261+
// `true` when the cost is expected to be too high.
262+
bool CheckAndReportIfImageTooLarge(const Pix* pix = nullptr /* default: use pix_original() data */) const;
263+
bool CheckAndReportIfImageTooLarge(int width, int height) const;
264+
250265
// Returns a pointer to a Pix representing the best available resolution image
251266
// of the page, with best available bit depth as second priority. Result can
252267
// be of any bit depth, but never color-mapped, as that has always been

Diff for: src/ccmain/thresholder.cpp

+14-5
Original file line numberDiff line numberDiff line change
@@ -534,15 +534,24 @@ std::tuple<bool, Image, Image, Image> ImageThresholder::Threshold(
534534
// Threshold the source image as efficiently as possible to the output Pix.
535535
// Creates a Pix and sets pix to point to the resulting pointer.
536536
// Caller must use pixDestroy to free the created Pix.
537+
//
537538
/// Returns false on error.
538539
bool ImageThresholder::ThresholdToPix(Image *pix) {
539-
#if !defined(LARGE_IMAGES)
540-
if (image_width_ > INT16_MAX || image_height_ > INT16_MAX) {
541-
tprintf("ERROR: Image too large: ({}, {})\n", image_width_, image_height_);
542-
return false;
540+
// tolerate overlarge images when they're about to be cropped by GetPixRect():
541+
if (IsFullImage()) {
542+
if (tesseract_->CheckAndReportIfImageTooLarge(pix_)) {
543+
return false;
544+
}
543545
}
544-
#endif
546+
else {
547+
// validate against the future cropped image size:
548+
if (tesseract_->CheckAndReportIfImageTooLarge(rect_width_, rect_height_)) {
549+
return false;
550+
}
551+
}
552+
545553
Image original = GetPixRect();
554+
546555
if (pix_channels_ == 0) {
547556
// We have a binary image, but it still has to be copied, as this API
548557
// allows the caller to modify the output.

0 commit comments

Comments
 (0)