Skip to content

Rust --capture api silently fails to extract base64 images (regex over-captures markdown title) #116

@konard

Description

@konard

Summary

web-capture --capture api --format markdown -o doc.md against any Google Doc on Rust 0.3.11 emits a multi-megabyte markdown with all base64 images inlined, even though the default contract should extract or use direct links (see issues 01 / 02 of this batch).

extract_and_save_images runs but reports extracted = 0 because every base64 decode fails. JS in the same configuration extracts cleanly because Buffer.from(b64, 'base64') silently strips invalid characters; Rust's strict decoder rejects them.

Reproduces on cargo install web-capture --version 0.3.11 and on main HEAD.

Root cause

rust/src/extract_images.rs and rust/src/gdocs.rs::extract_base64_images use this regex:

Regex::new(r"!\[([^\]]*)\]\(data:image/(png|jpeg|jpg|gif|webp|svg\+xml);base64,([^)]+)\)")

The third capture [^)]+ is greedy and stops only at ). The Rust HTML→Markdown converter emits image syntax with a trailing markdown title attribute, e.g. ![](data:image/png;base64,iVBOR...== ""), so base64_data becomes iVBOR...== "" (with the literal ""). STANDARD.decode(...) returns Err(Invalid symbol 61, offset N). The closure swallows the error in map_or_else and returns the original markdown unchanged. Final images.len() == 0.

Two fixes; pick either or both

  1. Stop emitting the empty title. The Rust converter outputs ![](path "") for every <img alt="">. JS does not. This is also a parity bug worth fixing on its own — see test below.
  2. Tighten the extract regex so the title cannot leak into the base64 group:
    r#"!\[([^\]]*)\]\(data:image/(png|jpeg|jpg|gif|webp|svg\+xml);base64,([A-Za-z0-9+/=]+)(?:\s+"[^"]*")?\)"#

Fix 1 is preferable; Fix 2 is a defensive belt-and-braces.

Reproducible test (fake data)

Rust unit — rust/tests/integration/extract_images_with_title.rs

use web_capture::extract_images::{extract_and_save_images, extract_base64_to_buffers};

const TINY_PNG: &str = "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==";

fn md_with_empty_title() -> String {
    // What the html2md pipeline currently emits for <img alt="" src="...">
    format!("Hello.\n\n![](data:image/png;base64,{TINY_PNG} \"\")\n\nWorld.\n")
}

#[test]
fn extract_and_save_images_handles_image_with_empty_title() {
    let tmp = tempfile::tempdir().unwrap();
    let result = extract_and_save_images(&md_with_empty_title(), tmp.path(), "images").unwrap();
    assert_eq!(result.extracted, 1);
    assert!(result.markdown.contains("images/image-"));
    assert!(!result.markdown.contains("data:image"));
    assert_eq!(std::fs::read_dir(tmp.path().join("images")).unwrap().count(), 1);
}

#[test]
fn extract_base64_to_buffers_handles_image_with_empty_title() {
    let result = extract_base64_to_buffers(&md_with_empty_title(), "images").unwrap();
    assert_eq!(result.images.len(), 1);
    assert!(!result.markdown.contains("data:image"));
}

Rust converter — rust/tests/integration/markdown_no_empty_title.rs

#[test]
fn img_with_empty_alt_must_not_emit_empty_title() {
    let html = r#"<p><img alt="" src="data:image/png;base64,iVBORw0KGgo="></p>"#;
    let md = web_capture::markdown::convert_html_to_markdown(html).unwrap();
    assert!(md.contains("![]("),    "expected markdown image syntax: {md}");
    assert!(!md.contains(r#" "")"#), "must NOT emit a trailing empty title attribute, got: {md}");
}

JS sanity — js/tests/unit/extract-images.test.js

it('extracts a base64 image even when followed by an empty markdown title', () => {
  const TINY_PNG = 'iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==';
  const md = `Hello.\n\n![](data:image/png;base64,${TINY_PNG} "")\n\nEnd.\n`;
  const result = extractAndSaveImages(md, tmpDir, 'images');
  expect(result.extracted).toBe(1);
  expect(result.markdown).toMatch(/images\/image-[a-f0-9]+\.png/);
});

JS likely passes today via lenient Buffer decoding; the test pins the contract.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions