Summary
web-capture --capture api --format markdown -o doc.md against any Google Doc on Rust 0.3.11 emits a multi-megabyte markdown with all base64 images inlined, even though the default contract should extract or use direct links (see issues 01 / 02 of this batch).
extract_and_save_images runs but reports extracted = 0 because every base64 decode fails. JS in the same configuration extracts cleanly because Buffer.from(b64, 'base64') silently strips invalid characters; Rust's strict decoder rejects them.
Reproduces on cargo install web-capture --version 0.3.11 and on main HEAD.
Root cause
rust/src/extract_images.rs and rust/src/gdocs.rs::extract_base64_images use this regex:
Regex::new(r"!\[([^\]]*)\]\(data:image/(png|jpeg|jpg|gif|webp|svg\+xml);base64,([^)]+)\)")
The third capture [^)]+ is greedy and stops only at ). The Rust HTML→Markdown converter emits image syntax with a trailing markdown title attribute, e.g. , so base64_data becomes iVBOR...== "" (with the literal ""). STANDARD.decode(...) returns Err(Invalid symbol 61, offset N). The closure swallows the error in map_or_else and returns the original markdown unchanged. Final images.len() == 0.
Two fixes; pick either or both
- Stop emitting the empty title. The Rust converter outputs
 for every <img alt="">. JS does not. This is also a parity bug worth fixing on its own — see test below.
- Tighten the extract regex so the title cannot leak into the base64 group:
r#"!\[([^\]]*)\]\(data:image/(png|jpeg|jpg|gif|webp|svg\+xml);base64,([A-Za-z0-9+/=]+)(?:\s+"[^"]*")?\)"#
Fix 1 is preferable; Fix 2 is a defensive belt-and-braces.
Reproducible test (fake data)
Rust unit — rust/tests/integration/extract_images_with_title.rs
use web_capture::extract_images::{extract_and_save_images, extract_base64_to_buffers};
const TINY_PNG: &str = "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==";
fn md_with_empty_title() -> String {
// What the html2md pipeline currently emits for <img alt="" src="...">
format!("Hello.\n\n\n\nWorld.\n")
}
#[test]
fn extract_and_save_images_handles_image_with_empty_title() {
let tmp = tempfile::tempdir().unwrap();
let result = extract_and_save_images(&md_with_empty_title(), tmp.path(), "images").unwrap();
assert_eq!(result.extracted, 1);
assert!(result.markdown.contains("images/image-"));
assert!(!result.markdown.contains("data:image"));
assert_eq!(std::fs::read_dir(tmp.path().join("images")).unwrap().count(), 1);
}
#[test]
fn extract_base64_to_buffers_handles_image_with_empty_title() {
let result = extract_base64_to_buffers(&md_with_empty_title(), "images").unwrap();
assert_eq!(result.images.len(), 1);
assert!(!result.markdown.contains("data:image"));
}
Rust converter — rust/tests/integration/markdown_no_empty_title.rs
#[test]
fn img_with_empty_alt_must_not_emit_empty_title() {
let html = r#"<p><img alt="" src="data:image/png;base64,iVBORw0KGgo="></p>"#;
let md = web_capture::markdown::convert_html_to_markdown(html).unwrap();
assert!(md.contains(", "expected markdown image syntax: {md}");
assert!(!md.contains(r#" "")"#), "must NOT emit a trailing empty title attribute, got: {md}");
}
JS sanity — js/tests/unit/extract-images.test.js
it('extracts a base64 image even when followed by an empty markdown title', () => {
const TINY_PNG = 'iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==';
const md = `Hello.\n\n\n\nEnd.\n`;
const result = extractAndSaveImages(md, tmpDir, 'images');
expect(result.extracted).toBe(1);
expect(result.markdown).toMatch(/images\/image-[a-f0-9]+\.png/);
});
JS likely passes today via lenient Buffer decoding; the test pins the contract.
Summary
web-capture --capture api --format markdown -o doc.mdagainst any Google Doc on Rust 0.3.11 emits a multi-megabyte markdown with all base64 images inlined, even though the default contract should extract or use direct links (see issues 01 / 02 of this batch).extract_and_save_imagesruns but reportsextracted = 0because every base64 decode fails. JS in the same configuration extracts cleanly becauseBuffer.from(b64, 'base64')silently strips invalid characters; Rust's strict decoder rejects them.Reproduces on
cargo install web-capture --version 0.3.11and onmainHEAD.Root cause
rust/src/extract_images.rsandrust/src/gdocs.rs::extract_base64_imagesuse this regex:The third capture
[^)]+is greedy and stops only at). The Rust HTML→Markdown converter emits image syntax with a trailing markdown title attribute, e.g., sobase64_databecomesiVBOR...== ""(with the literal"").STANDARD.decode(...)returnsErr(Invalid symbol 61, offset N). The closure swallows the error inmap_or_elseand returns the original markdown unchanged. Finalimages.len() == 0.Two fixes; pick either or both
for every<img alt="">. JS does not. This is also a parity bug worth fixing on its own — see test below.r#"!\[([^\]]*)\]\(data:image/(png|jpeg|jpg|gif|webp|svg\+xml);base64,([A-Za-z0-9+/=]+)(?:\s+"[^"]*")?\)"#Fix 1 is preferable; Fix 2 is a defensive belt-and-braces.
Reproducible test (fake data)
Rust unit —
rust/tests/integration/extract_images_with_title.rsRust converter —
rust/tests/integration/markdown_no_empty_title.rsJS sanity —
js/tests/unit/extract-images.test.jsJS likely passes today via lenient Buffer decoding; the test pins the contract.