Skip to content

Commit 4dfb7f9

Browse files
CorentinBNGTmeaty
andauthored
Add mend CLI command (#145)
* cli: add mend command * cli/mend: update doc * cli/mend: handle errors from flag retrieval * cli/mend: add test for force mode on closed WARC files * cmd/mend: refactor tests to use functions instead of binary and reorganize testdata * docs: add releases link to CLI installation section * cli/mend: auto-rename files when no truncation needed * cli/mend: rename totalBytesSaved to totalBytesTruncated * cli/verify: treat corrupted WARC version as error and clean up validation flow * cli/mend: verify mending output in tests * cli/utils: flip ShouldSkipRecord checks * cli/mend: prompt for deletion of empty and unused WARC files * gitignore: allow tracking test WARC files in testdata/warcs/ * cli: rename binary from cmd to warc Restructure cmd/ to cmd/warc/ to fix go install installing the binary as 'cmd' instead of 'warc'. Update all documentation and examples to reflect the new binary name. * fix: imports * fix: import --------- Co-authored-by: Jake L <NGTmeaty@users.noreply.github.com>
1 parent 401e95d commit 4dfb7f9

21 files changed

+1852
-391
lines changed

.github/workflows/build.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,6 @@ jobs:
2121
compress_assets: OFF
2222
md5sum: FALSE
2323
sha256sum: TRUE
24-
project_path: './cmd'
24+
project_path: './cmd/warc'
2525
binary_name: 'warc'
2626
asset_name: 'warc-linux-amd64'

.gitignore

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
1+
.DS_Store
12
warcs/**
23
temp/**
34
output/**
4-
warc
5+
/warc
56
*.warc.gz
6-
*.warc.zst
7+
*.warc.zst
8+
!testdata/warcs/*.warc.gz
9+
!testdata/warcs/*.warc.gz.open
10+
cmd/warc/warc

README.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,109 @@ func main() {
116116
}
117117
```
118118

119+
## CLI Tools
120+
121+
In addition to the Go library, gowarc provides several command-line utilities for working with WARC files:
122+
123+
### Installation
124+
125+
Pre-built releases are available on the [GitHub releases page](https://github.com/internetarchive/gowarc/releases).
126+
127+
```bash
128+
# Install from source
129+
go install github.com/internetarchive/gowarc/cmd/warc@latest
130+
131+
# Or build locally
132+
cd cmd/warc/
133+
go build -o warc
134+
```
135+
136+
### Available Commands
137+
138+
#### `warc extract`
139+
Extract files and content from WARC archives with filtering options.
140+
141+
```bash
142+
# Extract all files from WARC archives
143+
warc extract file1.warc.gz file2.warc.gz
144+
145+
# Extract only specific content types
146+
warc extract --content-type "text/html" --content-type "image/jpeg" archive.warc.gz
147+
148+
# Extract to specific directory with multiple threads
149+
warc extract --output ./extracted --threads 4 *.warc.gz
150+
151+
# Sort extracted files by host
152+
warc extract --host-sort archive.warc.gz
153+
```
154+
155+
#### `warc mend`
156+
Repair and close incomplete gzip-compressed WARC files that were left with `.open` suffix during crawling.
157+
158+
```bash
159+
# Dry run to see what would be fixed
160+
warc mend --dry-run *.warc.gz.open
161+
162+
# Fix files with confirmation prompts
163+
warc mend corrupted.warc.gz.open
164+
165+
# Auto-fix without prompts
166+
warc mend --yes *.warc.gz.open
167+
168+
# Force verification of any gzip WARC files (not just .open)
169+
warc mend --force --dry-run archive.warc.gz
170+
```
171+
172+
**Features:**
173+
- By default, only processes `.open` files; use `--force` to verify any gzip WARC files
174+
- Verifies gzip format using magic bytes, not just file extension
175+
- Detects and removes trailing garbage bytes
176+
- Truncates at corruption points while preserving maximum valid data
177+
- Removes `.open` suffix to "close" files when present
178+
- Provides comprehensive statistics on repairs performed
179+
- Memory-efficient streaming for large files
180+
181+
See [cmd/warc/mend/README.md](cmd/warc/mend/README.md) for detailed documentation.
182+
183+
#### `warc verify`
184+
Validate the integrity and structure of WARC files.
185+
186+
```bash
187+
# Verify single file
188+
warc verify archive.warc.gz
189+
190+
# Verify multiple files with progress
191+
warc verify -v *.warc.gz
192+
193+
# JSON output for automation
194+
warc verify --json archive.warc.gz
195+
```
196+
197+
#### `warc completion`
198+
Generate shell completion scripts for bash, zsh, fish, or PowerShell.
199+
200+
```bash
201+
# Bash completion
202+
warc completion bash > /etc/bash_completion.d/warc
203+
204+
# Zsh completion
205+
warc completion zsh > ~/.zsh/completions/_warc
206+
207+
# Fish completion
208+
warc completion fish > ~/.config/fish/completions/warc.fish
209+
210+
# PowerShell completion
211+
warc completion powershell > warc.ps1
212+
```
213+
214+
### Global Flags
215+
216+
All commands support these global options:
217+
218+
- `-v, --verbose` - Enable verbose/debug logging
219+
- `--json` - Output logs in JSON format for structured processing
220+
- `-h, --help` - Show help for any command
221+
119222
## Build tags
120223

121224
- `standard_gzip`: Use the standard library gzip implementation instead of the faster one from [klauspost](https://github.com/klauspost/compress)

cmd/main.go

Lines changed: 0 additions & 53 deletions
This file was deleted.

cmd/utils.go

Lines changed: 0 additions & 21 deletions
This file was deleted.

0 commit comments

Comments
 (0)