Skip to content

Commit 7d2e851

Browse files
committed
Introducing new APIs to assist with detecting and reporting overlarge input images. Available to both userland and tesseract internal code, these can be used to report & early fail images which are too large to fit in memory.
Some very lenient defaults are used for the memory pressure allowance (1.5 GByte for 32bit builds, 64GByte for 64bit builds) but this can be tweaked to your liking and local machine shop via Tesseract Global Variable `allowed_image_memory_capacity` (DOUBLE type). NOTE: the allowance limit can be effectively removed by setting this variable to an 'insane' value, e.g. `1.0e30`. HOWEVER, the CheckAndReportIfImageTooLarge() API will still fire for images with either width or high dimension >= TDIMENSION_MAX, which in the default built is the classic INT16_MAX (32767px); when compiled with defined(LARGE_IMAGES), then the width/height limit is raised to 24bit i.e. ~ 16.7 Mpx, which would then tolerate images smaller than 16777216 x 16777216px. (This latter part is a work-in-progress.) Related: - tesseract-ocr#3184 - tesseract-ocr#3885 - tesseract-ocr#3435 (pullreq by @stweil -- WIP) # Conflicts: # src/api/baseapi.cpp # src/ccmain/tesseractclass.h # src/ccmain/thresholder.cpp # src/ccutil/params.h # src/textord/tordmain.cpp
1 parent 9d71da7 commit 7d2e851

File tree

10 files changed

+640
-8
lines changed

10 files changed

+640
-8
lines changed

include/tesseract/baseapi.h

+32
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
#include "unichar.h"
2929

3030
#include <tesseract/version.h>
31+
#include <tesseract/memcost_estimate.h> // for ImageCostEstimate
3132

3233
#include <cstdio>
3334
#include <vector> // for std::vector
@@ -114,6 +115,37 @@ class TESS_API TessBaseAPI {
114115
int GetSourceYResolution();
115116
const char *GetDatapath();
116117

118+
/**
119+
* Return a memory capacity cost estimate for the given image dimensions and
120+
* some heuristics re tesseract behaviour, e.g. input images will be normalized/greyscaled,
121+
* then thresholded, all of which will be kept in memory while the session runs.
122+
*
123+
* Also uses the Tesseract Variable `allowed_image_memory_capacity` to indicate
124+
* whether the estimated cost is oversized --> `cost.is_too_large()`
125+
*
126+
* For user convenience, static functions are provided:
127+
* the static functions MAY be used by userland code *before* the high cost of
128+
* instantiating a Tesseract instance is incurred.
129+
*/
130+
static ImageCostEstimate EstimateImageMemoryCost(int image_width, int image_height, float allowance = 1.0e30f /* a.k.a.dont_care, use system limit and be done */ );
131+
static ImageCostEstimate EstimateImageMemoryCost(const Pix* pix, float allowance = 1.0e30f /* a.k.a. dont_care, use system limit and be done */ );
132+
133+
/**
134+
* Ditto, but this API may be invoked after SetInputImage() or equivalent has been called
135+
* and reports the cost estimate for the current instance/image.
136+
*/
137+
ImageCostEstimate EstimateImageMemoryCost() const;
138+
139+
/**
140+
* Helper, which may be invoked after SetInputImage() or equivalent has been called:
141+
* reports the cost estimate for the current instance/image via `tprintf()` and returns
142+
* `true` when the cost is expected to be too high.
143+
*
144+
* You can use this as a fast pre-flight check. Many major tesseract APIs perform
145+
* this same check as part of their startup routine.
146+
*/
147+
bool CheckAndReportIfImageTooLarge(const Pix* pix = nullptr /* default: use GetInputImage() data */ ) const;
148+
117149
/** Set the name of the bonus output files. Needed only for debugging. */
118150
void SetOutputName(const char *name);
119151

include/tesseract/memcost_estimate.h

+62
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
/**********************************************************************
2+
* File: memcost_estimate.h
3+
* Description: Inline routines and macros for serialisation functions
4+
* Author: Ger Hobbelt
5+
*
6+
* (C) Copyright 1990, Hewlett-Packard Ltd.
7+
** Licensed under the Apache License, Version 2.0 (the "License");
8+
** you may not use this file except in compliance with the License.
9+
** You may obtain a copy of the License at
10+
** http://www.apache.org/licenses/LICENSE-2.0
11+
** Unless required by applicable law or agreed to in writing, software
12+
** distributed under the License is distributed on an "AS IS" BASIS,
13+
** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
** See the License for the specific language governing permissions and
15+
** limitations under the License.
16+
*
17+
**********************************************************************/
18+
19+
#ifndef T_MEMCOST_ESTIMATE_H
20+
#define T_MEMCOST_ESTIMATE_H
21+
22+
#include <string>
23+
24+
namespace tesseract {
25+
26+
// Image memory capacity cost estimate report. Cost is measured in BYTES. Cost is reported
27+
// (`to_string()`) in GBYTES.
28+
//
29+
// Uses `allowed_image_memory_capacity` plus some compile-time heuristics to indicate
30+
// whether the estimated cost is oversized --> `cost.is_too_large()`
31+
struct ImageCostEstimate {
32+
float cost;
33+
34+
protected:
35+
float allowed_image_memory_capacity;
36+
37+
public:
38+
ImageCostEstimate()
39+
: ImageCostEstimate(0.0f, 1e30f) {
40+
}
41+
42+
ImageCostEstimate(float c, float allowance = 1e30f);
43+
44+
static float get_max_system_allowance();
45+
46+
float get_max_allowance() const;
47+
48+
void set_max_allowance(float allowance);
49+
50+
bool is_too_large() const;
51+
52+
std::string to_string() const;
53+
54+
// implicit conversion
55+
operator std::string() const;
56+
57+
static std::string capacity_to_string(float cost);
58+
};
59+
60+
} // namespace tesseract.
61+
62+
#endif

src/api/baseapi.cpp

+79
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@
7474
#include <set> // for std::pair
7575
#include <sstream> // for std::stringstream
7676
#include <vector> // for std::vector
77+
#include <cfloat>
7778

7879
#include <allheaders.h> // for pixDestroy, boxCreate, boxaAddBox, box...
7980
#ifdef HAVE_LIBCURL
@@ -102,6 +103,8 @@ static STRING_VAR(document_title, "", "Title of output document (used for hOCR a
102103
#ifdef HAVE_LIBCURL
103104
static INT_VAR(curl_timeout, 0, "Timeout for curl in seconds");
104105
#endif
106+
double_VAR(allowed_image_memory_capacity, ImageCostEstimate::get_max_system_allowance(), "Set maximum memory allowance for image data: this will be used as part of a sanity check for oversized input images.");
107+
105108

106109
/** Minimum sensible image size to be worth running tesseract. */
107110
const int kMinRectSize = 10;
@@ -271,6 +274,66 @@ void TessBaseAPI::SetInputName(const char *name) {
271274
input_file_ = name ? name : "";
272275
}
273276

277+
/**
278+
* Return a memory capacity cost estimate for the given image dimensions and
279+
* some heuristics re tesseract behaviour, e.g. input images will be normalized/greyscaled,
280+
* then thresholded, all of which will be kept in memory while the session runs.
281+
*
282+
* Also uses the Tesseract Variable `allowed_image_memory_capacity` to indicate
283+
* whether the estimated cost is oversized --> `cost.is_too_large()`
284+
*
285+
* For user convenience, static functions are provided:
286+
* the static functions MAY be used by userland code *before* the high cost of
287+
* instantiating a Tesseract instance is incurred.
288+
*/
289+
ImageCostEstimate TessBaseAPI::EstimateImageMemoryCost(int image_width, int image_height, float allowance) {
290+
// The heuristics used:
291+
//
292+
// we reckon with leptonica Pix storage at 4 bytes per pixel,
293+
// tesseract storing (worst case) 3 different images: original, greyscale, binary thresholded,
294+
// we DO NOT reckon with the extra image that may serve as background for PDF outputs, etc.
295+
// we DO NOT reckon with the memory cost for the OCR match tree, etc.
296+
// However, we attempt a VERY ROUGH estimate by calculating a 20% overdraft for internal operations'
297+
// storage costs.
298+
float cost = 4 * 3 * 1.20f;
299+
cost *= image_width;
300+
cost *= image_height;
301+
302+
if (allowed_image_memory_capacity > 0.0) {
303+
// any rediculous input values will be replaced by the Tesseract configuration value:
304+
if (allowance > allowed_image_memory_capacity || allowance <= 0.0)
305+
allowance = allowed_image_memory_capacity;
306+
}
307+
308+
return ImageCostEstimate(cost, allowance);
309+
}
310+
311+
ImageCostEstimate TessBaseAPI::EstimateImageMemoryCost(const Pix* pix, float allowance) {
312+
auto w = pixGetWidth(pix);
313+
auto h = pixGetHeight(pix);
314+
return EstimateImageMemoryCost(w, h, allowance);
315+
}
316+
317+
/**
318+
* Ditto, but this API may be invoked after SetInputImage() or equivalent has been called
319+
* and reports the cost estimate for the current instance/image.
320+
*/
321+
ImageCostEstimate TessBaseAPI::EstimateImageMemoryCost() const {
322+
return tesseract_->EstimateImageMemoryCost();
323+
}
324+
325+
/**
326+
* Helper, which may be invoked after SetInputImage() or equivalent has been called:
327+
* reports the cost estimate for the current instance/image via `tprintf()` and returns
328+
* `true` when the cost is expected to be too high.
329+
*
330+
* You can use this as a fast pre-flight check. Many major tesseract APIs perform
331+
* this same check as part of their startup routine.
332+
*/
333+
bool TessBaseAPI::CheckAndReportIfImageTooLarge(const Pix* pix) const {
334+
return tesseract_->CheckAndReportIfImageTooLarge(pix);
335+
}
336+
274337
/** Set the name of the output files. Needed only for debugging. */
275338
void TessBaseAPI::SetOutputName(const char *name) {
276339
output_file_ = name ? name : "";
@@ -1255,6 +1318,22 @@ bool TessBaseAPI::ProcessPage(Pix *pix, int page_index, const char *filename,
12551318
TessResultRenderer *renderer) {
12561319
SetInputName(filename);
12571320
SetImage(pix);
1321+
// Before wee start to do *real* work, do a preliminary sanity check re expected memory pressure.
1322+
// The check MAY recur in some (semi)public APIs that MAY be called later, but this is the big one
1323+
// and it's a simple check at negligible cost, saving us some headaches when we start feeding large
1324+
// material to the Tesseract animal.
1325+
//
1326+
// TODO: rescale overlarge input images? Or is that left to userland code? (as it'll be pretty fringe anyway)
1327+
{
1328+
auto cost = TessBaseAPI::EstimateImageMemoryCost(pix);
1329+
std::string cost_report = cost;
1330+
tprintf("Estimated memory pressure: {} for input image size {} x {} px\n", cost_report, pixGetWidth(pix), pixGetHeight(pix));
1331+
1332+
if (CheckAndReportIfImageTooLarge(pix)) {
1333+
return false; // fail early
1334+
}
1335+
}
1336+
12581337
bool failed = false;
12591338

12601339
if (tesseract_->tessedit_pageseg_mode == PSM_AUTO_ONLY) {

src/ccmain/tesseractclass.cpp

+41
Original file line numberDiff line numberDiff line change
@@ -617,4 +617,45 @@ void Tesseract::PrepareForTessOCR(BLOCK_LIST *block_list, Tesseract *osd_tess, O
617617
splitter_.Clear();
618618
}
619619

620+
// Return a memory capacity cost estimate for the given image / current original image.
621+
//
622+
// uses the current original image for the estimate, i.e. tells you the cost estimate of this run:
623+
ImageCostEstimate Tesseract::EstimateImageMemoryCost(const Pix* pix) const {
624+
// default: use pix_original() data
625+
if (pix == nullptr) {
626+
pix = pix_original();
627+
}
628+
629+
return TessBaseAPI::EstimateImageMemoryCost(pix, allowed_image_memory_capacity);
630+
}
631+
632+
// Helper, which may be invoked after SetInputImage() or equivalent has been called:
633+
// reports the cost estimate for the current instance/image via `tprintf()` and returns
634+
// `true` when the cost is expected to be too high.
635+
bool Tesseract::CheckAndReportIfImageTooLarge(const Pix* pix) const {
636+
// default: use pix_original() data
637+
if (pix == nullptr) {
638+
pix = pix_original();
639+
}
640+
641+
auto w = pixGetWidth(pix);
642+
auto h = pixGetHeight(pix);
643+
return CheckAndReportIfImageTooLarge(w, h);
644+
}
645+
646+
bool Tesseract::CheckAndReportIfImageTooLarge(int width, int height) const {
647+
auto cost = TessBaseAPI::EstimateImageMemoryCost(width, height, allowed_image_memory_capacity);
648+
649+
if (debug_all) {
650+
tprintf("Image size & memory cost estimate: {} x {} px, estimated cost {} vs. {} allowed capacity.\n",
651+
width, height, cost.to_string(), ImageCostEstimate::capacity_to_string(allowed_image_memory_capacity));
652+
}
653+
654+
if (width >= TDIMENSION_MAX || height >= TDIMENSION_MAX || cost.is_too_large()) {
655+
tprintf("ERROR: Image is too large: ({} x {} px, {})\n", width, height, cost.to_string());
656+
return true;
657+
}
658+
return false;
659+
}
660+
620661
} // namespace tesseract

src/ccmain/tesseractclass.h

+14
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@
4646

4747
#include <tesseract/publictypes.h> // for OcrEngineMode, PageSegMode, OEM_L...
4848
#include <tesseract/unichar.h> // for UNICHAR_ID
49+
#include <tesseract/memcost_estimate.h> // for ImageCostEstimate
4950

5051
#include <allheaders.h> // for pixDestroy, pixGetWidth, pixGetHe...
5152

@@ -165,6 +166,7 @@ struct WordData {
165166
PointerVector<WERD_RES> lang_words;
166167
};
167168

169+
168170
// Definition of a Tesseract WordRecognizer. The WordData provides the context
169171
// of row/block, in_word holds an initialized, possibly pre-classified word,
170172
// that the recognizer may or may not consume (but if so it sets
@@ -227,6 +229,18 @@ class TESS_API Tesseract : public Wordrec {
227229
lang->set_pix_original(original_pix ? original_pix.clone() : nullptr);
228230
}
229231
}
232+
233+
// Return a memory capacity cost estimate for the given image / current original image.
234+
//
235+
// (unless overridden by the `pix` argument) uses the current original image for the estimate,
236+
// i.e. tells you the cost estimate of this run:
237+
ImageCostEstimate EstimateImageMemoryCost(const Pix* pix = nullptr /* default: use pix_original() data */) const;
238+
// Helper, which may be invoked after SetInputImage() or equivalent has been called:
239+
// reports the cost estimate for the current instance/image via `tprintf()` and returns
240+
// `true` when the cost is expected to be too high.
241+
bool CheckAndReportIfImageTooLarge(const Pix* pix = nullptr /* default: use pix_original() data */) const;
242+
bool CheckAndReportIfImageTooLarge(int width, int height) const;
243+
230244
// Returns a pointer to a Pix representing the best available resolution image
231245
// of the page, with best available bit depth as second priority. Result can
232246
// be of any bit depth, but never color-mapped, as that has always been

src/ccmain/thresholder.cpp

+14-3
Original file line numberDiff line numberDiff line change
@@ -281,13 +281,24 @@ std::tuple<bool, Image, Image, Image> ImageThresholder::Threshold(
281281
// Threshold the source image as efficiently as possible to the output Pix.
282282
// Creates a Pix and sets pix to point to the resulting pointer.
283283
// Caller must use pixDestroy to free the created Pix.
284+
//
284285
/// Returns false on error.
285286
bool ImageThresholder::ThresholdToPix(Image *pix) {
286-
if (image_width_ > INT16_MAX || image_height_ > INT16_MAX) {
287-
tprintf("Image too large: (%d, %d)\n", image_width_, image_height_);
288-
return false;
287+
// tolerate overlarge images when they're about to be cropped by GetPixRect():
288+
if (IsFullImage()) {
289+
if (tesseract_->CheckAndReportIfImageTooLarge(pix_)) {
290+
return false;
291+
}
289292
}
293+
else {
294+
// validate against the future cropped image size:
295+
if (tesseract_->CheckAndReportIfImageTooLarge(rect_width_, rect_height_)) {
296+
return false;
297+
}
298+
}
299+
290300
Image original = GetPixRect();
301+
291302
if (pix_channels_ == 0) {
292303
// We have a binary image, but it still has to be copied, as this API
293304
// allows the caller to modify the output.

0 commit comments

Comments
 (0)