When --force-ocr option is in use we get chunky images, why not smoothed vector shapes #1499

GJoe2 · 2025-03-25T23:41:30Z

GJoe2
Mar 25, 2025

Inkscape have the ability to convert to smooth vectors all the typography, would be wholesome if OCRmypdf would take that as a base and paste the ocr results in it. Very similar to adobe clearscan

What we get:

Smooth vectors
Lesser file size

Use cases:
-When we have copyright unknown typography it happens when we copy text it pastes gibberish and squares, --redo-ocr wont touch it because it recognizes as something else and will refuse to go further and --force-ocr will convert it into images which is not ideal, if we manage to paste the vector results instead of images we will improve the product at overall.

Downsides:
-Inkscape convert the typography into shapes (areas) which inherently is way costly than just lines/outlines, but that would change if inkscape add some feature to make it more optimal for us

GJoe2 · 2025-03-25T23:54:59Z

GJoe2
Mar 25, 2025
Author

Check this pdf as an example
aci360r10.pdf

A similar feature :
https://acrobatusers.com/tutorials/better-pdf-ocr-clearscan-smaller-looks-better/

0 replies

jbarlow83 · 2025-03-26T19:21:50Z

jbarlow83
Mar 26, 2025
Maintainer

force ocr has an important use case of fixing broken text encodings or documents that their entire font rendered as vectors ("render text as curves"). Code cannot tell if a vector represents text, but that does matter to users.

For your use case you should use --force-ocr --oversample 600 or higher. --oversample parameter will cause rendering at a higher DPI. 400 is the automatic minimum when vectors are present so perhaps this is not enough for your documents. You could increase the compression level at the other end - high compression + high DPI gives better results than low compression on insufficient DPI, since you can think of compression as a sort of variable DPI.

If someone wanted to pull out the relevant code from Inkscape and integrate with ocrmypdf I'd consider it.

2 replies

GJoe2 Mar 28, 2025
Author

@jbarlow83 , I found the relevant code from Inkscape, it use Cairo library to convert text into paths, look at this:

svg-builder.cpp

/**
 * Renders the text as a path object using cairo and returns the node object.
 *
 * If the path is empty (e.g. due to trying to render a color bitmap font),
 * return path node with empty "d" attribute. The aria attribute will still
 * contain the original text.
 *
 * cairo_font   - The font that cairo can use to convert text to path.
 * font_size    - The size of the text when drawing the path.
 * transform    - The matrix which will place the text on the page, this is critical
 *                to allow cairo to render all the required parts of the text.
 * cairo_glyphs - A pointer to a list of glyphs to render.
 * count        - A count of the number of glyphs to render.
 */
Inkscape::XML::Node *SvgBuilder::_renderText(std::shared_ptr<CairoFont> cairo_font, double font_size,
                                             const Geom::Affine &transform,
                                             cairo_glyph_t *cairo_glyphs, unsigned int count)
{
    Inkscape::XML::Node *path = _addToContainer("svg:path");
    path->setAttribute("d", "");

    if (!cairo_glyphs || !cairo_font || _aria_label.empty()) {
        std::cerr << "SvgBuilder::_renderText: Invalid argument!" << std::endl;
        return path;
    }

    // The surface isn't actually used, no rendering in cairo takes place.
    cairo_surface_t *surface = cairo_image_surface_create(CAIRO_FORMAT_ARGB32, _width, _height);
    cairo_t *cairo = cairo_create(surface);
    cairo_set_font_face(cairo, cairo_font->getFontFace());
    cairo_set_font_size(cairo, font_size);
    ink_cairo_transform(cairo, transform);
    cairo_glyph_path(cairo, cairo_glyphs, count);
    auto pathv = extract_pathvector_from_cairo(cairo);
    cairo_destroy(cairo);
    cairo_surface_destroy(surface);

    // Failing to render text.
    if (!pathv) {
        std::cerr << "SvgBuilder::_renderText: Failed to render PDF text! " << _aria_label << std::endl;
        return path;
    }

    auto textpath = sp_svg_write_path(*pathv);
    path->setAttribute("d", textpath);

    if (textpath.empty()) {
        std::cerr << "SvgBuilder::_renderText: Empty path! " << _aria_label << std::endl;
    }

    return path;
}

That function is called from another function, that organize the conversion:

svg-builder.cpp

/**
 * Create path node(s) for text.
 */
Inkscape::XML::Node* SvgBuilder::_flushTextPath(GfxState *state, double text_scale, const Geom::Affine& text_transform)
{
    auto cairo_glyphs = (cairo_glyph_t *)gmallocn(_glyphs.size(), sizeof(cairo_glyph_t));
    unsigned int cairo_glyph_count = 0;

    Inkscape::XML::Node *node = nullptr;
    Inkscape::XML::Node *text_group = nullptr;  // Used to wrap paths if more that one path needed due
                                                // to style changes.

    auto first_glyph = _glyphs.front();
    for (auto it = _glyphs.begin(); it != _glyphs.end();  ++it ) {

        auto glyph = *it;

        // Append the coordinates to their respective strings
        Geom::Point delta_pos(glyph.text_position - first_glyph.text_position);
        delta_pos[1] += glyph.rise;
        delta_pos[1] *= -1.0;   // flip it
        delta_pos *= Geom::Scale(text_scale);

        // Push the data into the cairo glyph list for later rendering.
        cairo_glyphs[cairo_glyph_count].index = glyph.cairo_index;
        cairo_glyphs[cairo_glyph_count].x = delta_pos[Geom::X];
        cairo_glyphs[cairo_glyph_count].y = delta_pos[Geom::Y];
        cairo_glyph_count++;

        bool is_last_glyph = (it + 1) == _glyphs.end();
        bool flush_text = is_last_glyph ? true : (it+1)->style_changed;

        if (flush_text) {
            if (!is_last_glyph && !text_group) {
                text_group = _pushGroup(); // Create <g> wrapper if we have a style change mid-stream.
            }

            double text_size = text_scale * glyph.text_size;

            // Set to 'node' because if the style does NOT change, we won't have a group
            // but still need to set this text's position and blend modes.
            node = _renderText(glyph.cairo_font, text_size, text_transform, cairo_glyphs, cairo_glyph_count);
            if (!node) {
                g_warning("Empty or broken text in PDF file.");
                return nullptr;
            }
            _setTextStyle(node, glyph.state, nullptr, text_transform);

            if (text_group) {
                // Handled by _renderText
                // text_group->appendChild(node);
                // Inkscape::GC::release(node);
            }

            cairo_glyph_count = 0;

            if (is_last_glyph) {
                break;
            }
        }
    }

    // Clean up
    gfree(cairo_glyphs);
    cairo_glyphs = nullptr;

    if (text_group) {
        node = text_group;
        _popGroup();
    }

    node->setAttribute("aria-label", _aria_label);
    _aria_label = "";

    return node;
}

The final decorators are done by renderGlyphtext

cairo-render-context.cpp

/**
 * Called by Layout-TNG-Output, this function decides how to apply styles and
 * write out the final shapes of a set of glyphs to the target.
 *
 * font - The PangoFont to use in cairo.
 * font_matrix - The specific text transform to apply to these glyphs.
 * glyphtext - A list of glyphs to write or render out.
 * style - The style from the span or text node in context.
 * second_pass - True if this is being called in a second pass.
 *
 * Returns true if a second pass is required for fill over stroke paint order.
 */
bool
CairoRenderContext::renderGlyphtext(PangoFont *font, Geom::Affine const &font_matrix,
                                    std::vector<CairoGlyphInfo> const &glyphtext, SPStyle const *style,
                                    bool second_pass)
{
    _prepareRenderText();
    if (_is_omittext)
        return false;

    gpointer fonthash = (gpointer)font;
    cairo_font_face_t *font_face = nullptr;
    if (auto const it = _font_table.find(fonthash); it != _font_table.end()) {
        font_face = it->second;
    }

    FcPattern *fc_pattern = nullptr;

# ifdef CAIRO_HAS_FT_FONT
    PangoFcFont *fc_font = PANGO_FC_FONT(font);
    fc_pattern = fc_font->font_pattern;
    if (font_face == nullptr) {
        font_face = cairo_ft_font_face_create_for_pattern(fc_pattern);
        _font_table[fonthash] = font_face;
    }
# endif

    cairo_save(_cr);
    cairo_set_font_face(_cr, font_face);

    // set the given font matrix
    cairo_matrix_t matrix;
    ink_matrix_to_cairo(matrix, font_matrix);
    cairo_set_font_matrix(_cr, &matrix);

    if (_render_mode == RENDER_MODE_CLIP) {
        if (_clip_mode == CLIP_MODE_MASK) {
            if (style->fill_rule.computed == SP_WIND_RULE_EVENODD) {
                cairo_set_fill_rule(_cr, CAIRO_FILL_RULE_EVEN_ODD);
            } else {
                cairo_set_fill_rule(_cr, CAIRO_FILL_RULE_WINDING);
            }
            _showGlyphs(_cr, font, glyphtext, FALSE);
        } else {
            // just add the glyph paths to the current context
            _showGlyphs(_cr, font, glyphtext, TRUE);
        }
        cairo_restore(_cr);
        return false;
    }

    if (style->mix_blend_mode.set && style->mix_blend_mode.value) {
        cairo_set_operator(_cr, ink_css_blend_to_cairo_operator(style->mix_blend_mode.value));
    }

    bool fill = style->fill.isColor() || style->fill.isPaintserver();
    bool stroke = style->stroke.isColor() || style->stroke.isPaintserver();
    if (!fill && !stroke) {
        cairo_restore(_cr);
        return false;
    }

    // Text never has markers, and no-fill doesn't matter.
    bool stroke_over_fill = style->paint_order.get_order(SP_CSS_PAINT_ORDER_STROKE)
                          > style->paint_order.get_order(SP_CSS_PAINT_ORDER_FILL)
                          || !fill || !stroke;

    bool fill_pass = fill && stroke_over_fill != second_pass;
    bool stroke_pass = stroke && !second_pass;

    if (fill_pass) {
        _setFillStyle(style, Geom::OptRect());
        _showGlyphs(_cr, font, glyphtext, _is_texttopath);
        if (_is_texttopath)
            cairo_fill_preserve(_cr);
    }

    // Stroke paths are generated for texttopath AND glyph output
    // because PDF text output doesn't support stroke and fill
    if (stroke_pass) {
        // And now we don't have a path to stroke, so make one.
        if (!_is_texttopath || !fill_pass)
            _showGlyphs(_cr, font, glyphtext, true);
        _setStrokeStyle(style, Geom::OptRect());
        cairo_stroke(_cr);
    }

    cairo_restore(_cr);
    return !stroke_over_fill && !second_pass;
}

How can we integrate this functionality into the project?

jbarlow83 Mar 29, 2025
Maintainer

Ideally, find Python bindings of cairo and pango and any other dependent libraries, and then replicate the same algorithm in Python. Also, cairo-render-context seems relevant, but we would need to render fonts to a PDF content stream backend instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When --force-ocr option is in use we get chunky images, why not smoothed vector shapes #1499

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

When --force-ocr option is in use we get chunky images, why not smoothed vector shapes #1499

GJoe2 Mar 25, 2025

Replies: 2 comments · 2 replies

GJoe2 Mar 25, 2025 Author

jbarlow83 Mar 26, 2025 Maintainer

GJoe2 Mar 28, 2025 Author

jbarlow83 Mar 29, 2025 Maintainer

GJoe2
Mar 25, 2025

Replies: 2 comments 2 replies

GJoe2
Mar 25, 2025
Author

jbarlow83
Mar 26, 2025
Maintainer

GJoe2 Mar 28, 2025
Author

jbarlow83 Mar 29, 2025
Maintainer