Skip to content

Commit 8fe502d

Browse files
authored
Merge pull request #132 from link-assistant/issue-112-53d54c67d8bb
Pin default image-mode contract: `-f markdown` keeps direct links; flags work uniformly
2 parents 84a7168 + 5920582 commit 8fe502d

12 files changed

Lines changed: 733 additions & 161 deletions

File tree

README.md

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -81,16 +81,32 @@ web-capture --serve --port 8080
8181
| `--output` | `-o` | Output file path. Use `-o -` for stdout | auto-derived from URL |
8282
| `--data-dir` | | Base directory for auto-derived output paths | `./data/web-capture` |
8383
| `--engine` | `-e` | Browser engine (JS only): `puppeteer`, `playwright` | `puppeteer` |
84-
| `--embed-images` | | Keep images as inline base64 data URIs | `false` |
84+
| `--embed-images` | | Keep images inline as base64 data URIs (self-contained file) | `false` |
8585
| `--no-extract-images` | | Alias for `--embed-images` | `false` |
86-
| `--keep-original-links` | | Keep original remote image URLs, strip base64 | `false` |
86+
| `--extract-images[=DIR]` | | Extract images to `DIR/images/` (or next to the output) and download remote images | - |
87+
| `--keep-original-links` | | Keep remote image URLs as direct links (the default markdown behavior) | `false` |
8788
| `--images-dir` | | Subdirectory name for extracted images | `images` |
8889
| `--archive` | | Create archive: `zip` (default), `7z`, `tar.gz`, `tar` | - |
8990
| `--extract-latex` | | Extract LaTeX formulas | `true` |
9091
| `--extract-metadata` | | Extract article metadata | `true` |
9192
| `--post-process` | | Apply post-processing | `true` |
9293
| `--detect-code-language` | | Detect code block languages | `true` |
9394

95+
## Image Handling
96+
97+
Markdown output supports three image modes, and every capture path (browser or
98+
API, CLI or server) routes through the same chokepoint so a flag behaves
99+
identically regardless of how the page was captured:
100+
101+
| Mode | Flag | Result |
102+
| --------------------------- | --------------------- | ------------------------------------------------------------------------------------------ |
103+
| **Direct links** (default) | _none_ / `--keep-original-links` | Remote images stay as direct `https://…` URLs. Inline base64 (which has no remote URL to restore) is stripped to a placeholder with a warning — never silently kept as a multi-megabyte blob. No `images/` folder. |
104+
| **Embed** | `--embed-images` | Base64 images are kept inline, producing a single self-contained file. |
105+
| **Extract** | `--extract-images[=DIR]` | Inline base64 _and_ remote images are written to `DIR/images/` (defaults to next to the output file) and the markdown is rewritten to reference the local files. |
106+
107+
The `--archive` formats always bundle images into the archive's `images/`
108+
folder regardless of these flags.
109+
94110
## Environment Variables
95111

96112
All flags can be controlled via environment variables:
@@ -99,6 +115,7 @@ All flags can be controlled via environment variables:
99115
| ---------------------------------- | ----------------------------------- | -------------------- |
100116
| `WEB_CAPTURE_DATA_DIR` | Base directory for output | `./data/web-capture` |
101117
| `WEB_CAPTURE_EMBED_IMAGES` | `0`/`1` — keep images inline | `0` |
118+
| `WEB_CAPTURE_EXTRACT_IMAGES` | Directory to extract images into | - |
102119
| `WEB_CAPTURE_KEEP_ORIGINAL_LINKS` | `0`/`1` — keep original remote URLs | `0` |
103120
| `WEB_CAPTURE_IMAGES_DIR` | Subdirectory for extracted images | `images` |
104121
| `WEB_CAPTURE_EXTRACT_LATEX` | `0`/`1` — extract LaTeX | `1` |
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
'@link-assistant/web-capture': minor
3+
---
4+
5+
Pin the default image-mode contract and route every capture path through a single image-handling chokepoint (`applyImageMode`), so the same flag behaves identically regardless of capture method (browser vs API, CLI vs server) — issue #112. Default `--format markdown` now references images by their direct remote URL (no `images/` folder, no inline base64); inline base64 (which has no remote URL to restore) is stripped to a visible placeholder with a warning instead of being silently kept as a multi-megabyte blob. `--embed-images` keeps base64 inline for a self-contained file. The new `--extract-images[=DIR]` flag extracts inline base64 **and** downloads remote images into `DIR/images/`, rewriting the markdown to reference the local files; on download failure the original remote URL is restored so references never break. `--keep-original-links` remains a back-compat alias for the default behavior.

js/README.md

Lines changed: 35 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -73,11 +73,17 @@ web-capture https://example.com --archive
7373
web-capture https://example.com --archive zip -o site.zip
7474
web-capture https://example.com --archive tar.gz -o site.tar.gz
7575

76-
# Keep images inline as base64 (opt-in)
76+
# Default markdown keeps remote image URLs as direct links
77+
web-capture https://example.com -o page.md
78+
79+
# Keep images inline as base64 (self-contained file, opt-in)
7780
web-capture https://example.com --embed-images -o page.md
7881

82+
# Extract images (inline base64 + remote) to a local images/ folder
83+
web-capture https://example.com --extract-images -o page.md
84+
7985
# Custom images directory
80-
web-capture https://example.com --images-dir assets -o page.md
86+
web-capture https://example.com --extract-images --images-dir assets -o page.md
8187

8288
# Disable specific features
8389
web-capture https://example.com --no-extract-latex --no-post-process -o page.md
@@ -245,32 +251,33 @@ web-capture --serve [--port <port>]
245251
web-capture <url> [options]
246252
```
247253

248-
| Option | Short | Description | Default |
249-
| --------------------------- | ----- | ---------------------------------------------- | ----------------------------------- |
250-
| `--format` | `-f` | Output format (see below) | `markdown` |
251-
| `--output` | `-o` | Output file path. Use `-o -` for stdout | auto-derived from URL |
252-
| `--data-dir` | | Base directory for auto-derived output paths | `./data/web-capture` |
253-
| `--engine` | `-e` | Browser engine: `puppeteer`, `playwright` | `puppeteer` (or BROWSER_ENGINE env) |
254-
| `--theme` | `-t` | Color scheme: `light`, `dark`, `no-preference` | browser default |
255-
| `--width` | | Viewport width in pixels | 1280 |
256-
| `--height` | | Viewport height in pixels | 800 |
257-
| `--quality` | | JPEG quality 0-100 | 80 |
258-
| `--fullPage` | | Capture full scrollable page | false |
259-
| `--embed-images` | | Keep images as inline base64 data URIs | false |
260-
| `--no-extract-images` | | Alias for `--embed-images` | false |
261-
| `--keep-original-links` | | Keep original remote URLs, strip base64 | false |
262-
| `--images-dir` | | Subdirectory name for extracted images | `images` |
263-
| `--archive` | | Create archive: `zip`, `7z`, `tar.gz`, `tar` | - |
264-
| `--document-format` | | Document format in archive: `markdown`, `html` | `markdown` |
265-
| `--localImages` | | Download images locally in archive mode | true |
266-
| `--extract-latex` | | Extract LaTeX formulas | true |
267-
| `--no-extract-latex` | | Disable LaTeX extraction | - |
268-
| `--extract-metadata` | | Extract article metadata | true |
269-
| `--no-extract-metadata` | | Disable metadata extraction | - |
270-
| `--post-process` | | Apply post-processing | true |
271-
| `--no-post-process` | | Disable post-processing | - |
272-
| `--detect-code-language` | | Detect code block languages | true |
273-
| `--no-detect-code-language` | | Disable code language detection | - |
254+
| Option | Short | Description | Default |
255+
| --------------------------- | ----- | ------------------------------------------------- | ----------------------------------- |
256+
| `--format` | `-f` | Output format (see below) | `markdown` |
257+
| `--output` | `-o` | Output file path. Use `-o -` for stdout | auto-derived from URL |
258+
| `--data-dir` | | Base directory for auto-derived output paths | `./data/web-capture` |
259+
| `--engine` | `-e` | Browser engine: `puppeteer`, `playwright` | `puppeteer` (or BROWSER_ENGINE env) |
260+
| `--theme` | `-t` | Color scheme: `light`, `dark`, `no-preference` | browser default |
261+
| `--width` | | Viewport width in pixels | 1280 |
262+
| `--height` | | Viewport height in pixels | 800 |
263+
| `--quality` | | JPEG quality 0-100 | 80 |
264+
| `--fullPage` | | Capture full scrollable page | false |
265+
| `--embed-images` | | Keep images inline as base64 (self-contained) | false |
266+
| `--no-extract-images` | | Alias for `--embed-images` | false |
267+
| `--extract-images[=DIR]` | | Extract images to `DIR/images/` + download remote | - |
268+
| `--keep-original-links` | | Keep remote URLs as direct links (the default) | false |
269+
| `--images-dir` | | Subdirectory name for extracted images | `images` |
270+
| `--archive` | | Create archive: `zip`, `7z`, `tar.gz`, `tar` | - |
271+
| `--document-format` | | Document format in archive: `markdown`, `html` | `markdown` |
272+
| `--localImages` | | Download images locally in archive mode | true |
273+
| `--extract-latex` | | Extract LaTeX formulas | true |
274+
| `--no-extract-latex` | | Disable LaTeX extraction | - |
275+
| `--extract-metadata` | | Extract article metadata | true |
276+
| `--no-extract-metadata` | | Disable metadata extraction | - |
277+
| `--post-process` | | Apply post-processing | true |
278+
| `--no-post-process` | | Disable post-processing | - |
279+
| `--detect-code-language` | | Detect code block languages | true |
280+
| `--no-detect-code-language` | | Disable code language detection | - |
274281

275282
**Supported formats:**
276283

0 commit comments

Comments
 (0)