Rust `--capture api` silently fails to extract base64 images (regex over-captures markdown title)

## Summary

`web-capture --capture api --format markdown -o doc.md` against any Google Doc on Rust 0.3.11 emits a multi-megabyte markdown with all base64 images inlined, even though the default contract should extract or use direct links (see issues 01 / 02 of this batch).

`extract_and_save_images` runs but reports `extracted = 0` because **every base64 decode fails**. JS in the same configuration extracts cleanly because `Buffer.from(b64, 'base64')` silently strips invalid characters; Rust's strict decoder rejects them.

Reproduces on `cargo install web-capture --version 0.3.11` and on `main` HEAD.

## Root cause

`rust/src/extract_images.rs` and `rust/src/gdocs.rs::extract_base64_images` use this regex:

```rust
Regex::new(r"!\[([^\]]*)\]$data:image/(png|jpeg|jpg|gif|webp|svg\+xml);base64,([^)]+)$")
```

The third capture `[^)]+` is greedy and stops only at `)`. The Rust HTML→Markdown converter emits image syntax with a trailing markdown title attribute, e.g. `![](data:image/png;base64,iVBOR...== "")`, so `base64_data` becomes `iVBOR...== ""` (with the literal ` ""`). `STANDARD.decode(...)` returns `Err(Invalid symbol 61, offset N)`. The closure swallows the error in `map_or_else` and returns the original markdown unchanged. Final `images.len() == 0`.

## Two fixes; pick either or both

1. **Stop emitting the empty title.** The Rust converter outputs `![](path "")` for every `<img alt="">`. JS does not. This is also a parity bug worth fixing on its own — see test below.
2. **Tighten the extract regex** so the title cannot leak into the base64 group:
   ```rust
   r#"!\[([^\]]*)\]$data:image/(png|jpeg|jpg|gif|webp|svg\+xml);base64,([A-Za-z0-9+/=]+)(?:\s+"[^"]*")?$"#
   ```

Fix 1 is preferable; Fix 2 is a defensive belt-and-braces.

## Reproducible test (fake data)

### Rust unit — `rust/tests/integration/extract_images_with_title.rs`

```rust
use web_capture::extract_images::{extract_and_save_images, extract_base64_to_buffers};

const TINY_PNG: &str = "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==";

fn md_with_empty_title() -> String {
    // What the html2md pipeline currently emits for <img alt="" src="...">
    format!("Hello.\n\n![](data:image/png;base64,{TINY_PNG} \"\")\n\nWorld.\n")
}

#[test]
fn extract_and_save_images_handles_image_with_empty_title() {
    let tmp = tempfile::tempdir().unwrap();
    let result = extract_and_save_images(&md_with_empty_title(), tmp.path(), "images").unwrap();
    assert_eq!(result.extracted, 1);
    assert!(result.markdown.contains("images/image-"));
    assert!(!result.markdown.contains("data:image"));
    assert_eq!(std::fs::read_dir(tmp.path().join("images")).unwrap().count(), 1);
}

#[test]
fn extract_base64_to_buffers_handles_image_with_empty_title() {
    let result = extract_base64_to_buffers(&md_with_empty_title(), "images").unwrap();
    assert_eq!(result.images.len(), 1);
    assert!(!result.markdown.contains("data:image"));
}
```

### Rust converter — `rust/tests/integration/markdown_no_empty_title.rs`

```rust
#[test]
fn img_with_empty_alt_must_not_emit_empty_title() {
    let html = r#"<p><img alt="" src="data:image/png;base64,iVBORw0KGgo="></p>"#;
    let md = web_capture::markdown::convert_html_to_markdown(html).unwrap();
    assert!(md.contains("![]("),    "expected markdown image syntax: {md}");
    assert!(!md.contains(r#" "")"#), "must NOT emit a trailing empty title attribute, got: {md}");
}
```

### JS sanity — `js/tests/unit/extract-images.test.js`

```js
it('extracts a base64 image even when followed by an empty markdown title', () => {
  const TINY_PNG = 'iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==';
  const md = `Hello.\n\n![](data:image/png;base64,${TINY_PNG} "")\n\nEnd.\n`;
  const result = extractAndSaveImages(md, tmpDir, 'images');
  expect(result.extracted).toBe(1);
  expect(result.markdown).toMatch(/images\/image-[a-f0-9]+\.png/);
});
```

JS likely passes today via lenient Buffer decoding; the test pins the contract.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rust `--capture api` silently fails to extract base64 images (regex over-captures markdown title) #116

Summary

Root cause

Two fixes; pick either or both

Reproducible test (fake data)

Rust unit — `rust/tests/integration/extract_images_with_title.rs`

Rust converter — `rust/tests/integration/markdown_no_empty_title.rs`

JS sanity — `js/tests/unit/extract-images.test.js`

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Rust --capture api silently fails to extract base64 images (regex over-captures markdown title) #116

Description

Summary

Root cause

Two fixes; pick either or both

Reproducible test (fake data)

Rust unit — rust/tests/integration/extract_images_with_title.rs

Rust converter — rust/tests/integration/markdown_no_empty_title.rs

JS sanity — js/tests/unit/extract-images.test.js

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Rust `--capture api` silently fails to extract base64 images (regex over-captures markdown title) #116

Rust unit — `rust/tests/integration/extract_images_with_title.rs`

Rust converter — `rust/tests/integration/markdown_no_empty_title.rs`

JS sanity — `js/tests/unit/extract-images.test.js`