Skip to content

Commit c0ae879

Browse files
committed
fix(share): gzip media snapshots
1 parent 05d209c commit c0ae879

11 files changed

Lines changed: 258 additions & 27 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44

55
### Changes
66

7+
- Media-enabled `discrawl publish` now migrates shared attachment media to gzip-compressed files while still importing older raw-media snapshots.
8+
79
### Fixes
810

911
## 0.8.0 - 2026-05-15

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -530,7 +530,7 @@ Once `share.remote` is configured, read commands auto-fetch and import when the
530530

531531
Hybrid mode is supported too: keep normal Discord credentials configured and set `share.remote`. `discrawl sync --update=auto` and `discrawl messages --sync` import the Git snapshot first, usually as a changed-shard delta, then use live Discord for latest-message deltas. Use `sync --all-channels` or `sync --full` when you intentionally want a broader live repair/backfill pass.
532532

533-
Git snapshots publish non-DM archive tables and cached non-DM attachment media by default. DMs, desktop wiretap rows, DM media, and local secrets are never exported. Use `publish --no-media` to omit cached media files.
533+
Git snapshots publish non-DM archive tables and cached non-DM attachment media by default. Cached media is written as gzip-compressed files under `media/` and restored to raw local cache files on import. Older snapshots that contain raw media files still import, and the next media-enabled `publish` rewrites the media tree into gzip form. DMs, desktop wiretap rows, DM media, and local secrets are never exported. Use `publish --no-media` to omit cached media files.
534534
Subscribers can use `subscribe --no-media` or `update --no-media` to import only SQLite rows and skip restoring cached files.
535535

536536
Media backup is a two-step publisher workflow: first fetch bytes with `discrawl sync --with-media` or `discrawl attachments fetch`, then publish with `discrawl publish --push`. Scheduled publishers that should include media can set `sync.attachment_media = true` and leave `share.media = true`, which is the default. `publish` never downloads missing Discord files by itself; it exports only media already present in the local cache.

docs/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Mirror Discord guilds into local SQLite. Search server history without depending
77
- discovers every guild a bot can access and syncs channels, threads, members, and message history into SQLite
88
- maintains FTS5 indexes for fast literal search; optional embeddings for semantic and hybrid recall
99
- imports classifiable Discord Desktop cache messages with `wiretap`, including proven DMs under `@me`
10-
- downloads attachment media into the local cache and includes cached non-DM media in Git snapshots
10+
- downloads attachment media into the local cache and includes cached non-DM media as gzip-compressed Git snapshot files
1111
- tails the Gateway for live updates with periodic repair sweeps
1212
- publishes the archive as sharded NDJSON snapshots in a private Git repo so readers can search offline with no Discord credentials
1313
- exposes read-only SQL, channel/member directories, mention queries, digests, and trend analytics

docs/commands/attachments.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,8 @@ discrawl --json attachments --missing --all
3838
- media bytes are stored under `cache_dir/media`, not in SQLite
3939
- SQLite stores attachment metadata, content hash, cached media path, fetch status, and errors
4040
- Discord CDN URLs can expire or be removed; those fetches are recorded as failed with their HTTP status, commonly `404`
41-
- `attachments fetch` only populates the local cache; run `publish --push` afterward to copy cached non-DM media into the Git snapshot repo
42-
- `publish` backs up cached non-DM media files by default; use `publish --no-media` to omit them
41+
- `attachments fetch` only populates the local cache; run `publish --push` afterward to copy cached non-DM media into the Git snapshot repo as gzip-compressed files
42+
- `publish` backs up cached non-DM media files by default and migrates older raw snapshot media to gzip form; use `publish --no-media` to omit them
4343
- `@me` DM media is local-only and is not published to Git snapshots
4444

4545
## See also

docs/commands/publish.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ README files without Discrawl report markers are left alone.
6060
## What is published
6161

6262
- non-DM archive tables (DM `@me` rows are always excluded)
63-
- cached non-DM attachment media files under `media/` unless `--no-media` is used
63+
- cached non-DM attachment media files under `media/` as gzip-compressed files unless `--no-media` is used
6464
- when filters are enabled: only matching guilds, channels, messages, events,
6565
attachments, mentions, channel-scoped sync-state rows, member rows referenced
6666
by matching messages, and matching embedding rows
@@ -83,6 +83,10 @@ fetch` before publishing when the Git snapshot should include newly discovered
8383
media. Scheduled publishers can set `sync.attachment_media = true` and leave
8484
`share.media = true`, the default.
8585

86+
Media snapshots are gzip-only on publish. Legacy snapshots that stored raw
87+
`media/...` files remain importable for backward compatibility, and the next
88+
media publish rewrites the media tree as `media/...gz`.
89+
8690
## See also
8791

8892
- [Git snapshots guide](../guides/git-snapshots.html)

docs/commands/sync.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ discrawl sync --with-media
6868

6969
- `--latest-only` is the default for untargeted `sync`. Use `--all-channels` to opt out without doing a full historical crawl.
7070
- `--with-media` records expired or removed Discord CDN URLs as failed fetches with the HTTP status, commonly `404`.
71-
- `--with-media` updates the local cache only; run `publish --push` afterward to include cached non-DM media in the Git backup.
71+
- `--with-media` updates the local cache only; run `publish --push` afterward to include cached non-DM media in the Git backup as gzip-compressed files.
7272
- `--since` does not mark older history as complete, so a later `sync --full` without `--since` can continue the backfill.
7373
- Long runs emit periodic progress logs to stderr.
7474
- Heartbeat logs (`message sync waiting`) name the oldest active channel and per-channel page activity if in-flight channels stop completing for a while.

docs/configuration.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ Set `discord.token_source = "keyring"` if you want to require keyring lookup and
122122
- changing `[search.embeddings]` provider/model/input version retargets pending jobs and resets prior attempts; existing vectors for another identity remain in SQLite but are not used for semantic search
123123
- changing `db_path` does not migrate existing data; copy the file yourself if you want to keep history
124124
- `sync.attachment_media = true` makes `sync` behave like `sync --with-media`; media bytes are cached under `cache_dir/media`, and CDN `404`/other fetch failures are recorded on attachment rows
125-
- `share.media = false` makes publish/update/auto-update omit or skip restoring cached media; `subscribe --no-media` writes this for Git-only readers. With the default `share.media = true`, publish/update include cached non-DM media, but publish does not fetch missing Discord files by itself.
125+
- `share.media = false` makes publish/update/auto-update omit or skip restoring cached media; `subscribe --no-media` writes this for Git-only readers. With the default `share.media = true`, publish/update include cached non-DM media as gzip-compressed snapshot files, but publish does not fetch missing Discord files by itself.
126126
- `[share.filter]` narrows only `publish` output; sync can still keep a richer local archive
127127
- `share.filter.public_only` exports only channels visible to the guild
128128
`@everyone` role after category/channel permission overwrites; private

docs/guides/git-snapshots.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -80,8 +80,8 @@ discrawl sync --full # historical backfill
8080
## What is published
8181

8282
- non-DM archive tables (DM `@me` rows are always excluded)
83-
- cached non-DM attachment media by default; use `publish --no-media` to omit
84-
files that are already in `cache_dir/media`
83+
- cached non-DM attachment media as gzip-compressed files by default; use
84+
`publish --no-media` to omit files that are already in `cache_dir/media`
8585
- with publish filters: only matching channel-scoped rows, matching embedding
8686
rows, and member rows referenced by matching messages
8787
- with publish filters: no share manifest state and no guild-level member
@@ -101,10 +101,14 @@ discrawl publish --push
101101

102102
`sync --with-media` and `attachments fetch` download Discord attachment bytes
103103
into `cache_dir/media`. `publish --push` then exports cached non-DM media into
104-
the Git snapshot repo. `publish` does not fetch missing Discord files itself,
105-
so scheduled Git backups that should include media must fetch media before
106-
publishing. Set `sync.attachment_media = true` for scheduled sync jobs and leave
107-
`share.media = true` to include cached media in publish/update flows.
104+
the Git snapshot repo as gzip-compressed `media/...gz` files. Imports restore
105+
those files back into the raw local cache layout. Older snapshots that contain
106+
raw `media/...` files still import; the next media publish clears the legacy
107+
media tree and rewrites it in gzip form. `publish` does not fetch missing
108+
Discord files itself, so scheduled Git backups that should include media must
109+
fetch media before publishing. Set `sync.attachment_media = true` for scheduled
110+
sync jobs and leave `share.media = true` to include cached media in
111+
publish/update flows.
108112

109113
Discord CDN URLs can expire or be removed. Those fetches are stored as failed
110114
with their HTTP status, commonly `404`; this does not block publishing files

docs/security.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ Attachment binaries are not stored in SQLite. Only attachment metadata, optional
4242

4343
Set `sync.attachment_text = false` if you want to keep attachment metadata and filenames but disable attachment body fetches for text indexing.
4444

45-
Git snapshots include cached non-DM media files by default. Use `publish --no-media` to omit them. `publish` exports only files already in the local cache; it does not fetch missing Discord media. DM media under `@me` stays local-only.
45+
Git snapshots include cached non-DM media files by default. Use `publish --no-media` to omit them. `publish` exports only files already in the local cache; it does not fetch missing Discord media. Published media is gzip-compressed under `media/`, while import still accepts older raw-media snapshots for backward compatibility. DM media under `@me` stays local-only.
4646

4747
## What is sent over the wire
4848

internal/share/share.go

Lines changed: 163 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -883,8 +883,8 @@ func exportMedia(ctx context.Context, db *sql.DB, opts Options, filter *snapshot
883883
if strings.TrimSpace(opts.CacheDir) == "" {
884884
return nil, nil
885885
}
886-
if err := os.RemoveAll(filepath.Join(opts.RepoPath, "media")); err != nil {
887-
return nil, fmt.Errorf("reset media dir: %w", err)
886+
if err := resetCompressedMediaExport(opts.RepoPath); err != nil {
887+
return nil, err
888888
}
889889
rows, err := db.QueryContext(ctx, `
890890
select attachment_id, message_id, guild_id, channel_id, coalesce(media_path, ''), coalesce(content_sha256, '')
@@ -934,22 +934,30 @@ func exportMedia(ctx context.Context, db *sql.DB, opts Options, filter *snapshot
934934
}
935935
manifest.Attachments++
936936
seen[mediaPath] = struct{}{}
937-
rel := filepath.ToSlash(filepath.Join("media", mediaPath))
938-
target, err := media.RepoPath(opts.RepoPath, mediaPath)
937+
rel := compressedMediaManifestPath(mediaPath)
938+
target, err := compressedMediaRepoPath(opts.RepoPath, mediaPath)
939939
if err != nil {
940940
return nil, err
941941
}
942-
if err := copyFile(target, source); err != nil {
943-
return nil, fmt.Errorf("copy media %s: %w", mediaPath, err)
942+
if err := copyGzipFile(target, source); err != nil {
943+
return nil, fmt.Errorf("compress media %s: %w", mediaPath, err)
944944
}
945945
hash, err := fileSHA256(target)
946946
if err != nil {
947947
return nil, err
948948
}
949-
if expectedHash != "" && hash != expectedHash {
950-
return nil, fmt.Errorf("media hash mismatch for %s: got %s want %s", mediaPath, hash, expectedHash)
949+
sourceHash, err := fileSHA256(source)
950+
if err != nil {
951+
return nil, err
952+
}
953+
if expectedHash != "" && sourceHash != expectedHash {
954+
return nil, fmt.Errorf("media hash mismatch for %s: got %s want %s", mediaPath, sourceHash, expectedHash)
951955
}
952-
manifest.Files = append(manifest.Files, snapshot.FileManifest{Path: rel, Size: info.Size(), SHA256: hash})
956+
compressedInfo, err := os.Stat(target)
957+
if err != nil {
958+
return nil, fmt.Errorf("stat compressed media %s: %w", rel, err)
959+
}
960+
manifest.Files = append(manifest.Files, snapshot.FileManifest{Path: rel, Size: compressedInfo.Size(), SHA256: hash})
953961
manifest.Bytes += info.Size()
954962
}
955963
if err := rows.Err(); err != nil {
@@ -961,6 +969,15 @@ func exportMedia(ctx context.Context, db *sql.DB, opts Options, filter *snapshot
961969
return manifest, nil
962970
}
963971

972+
func resetCompressedMediaExport(repoPath string) error {
973+
// A publish rewrites media from the local cache. Clearing the tree here is
974+
// the forward migration from legacy raw media files to gzip-only snapshots.
975+
if err := os.RemoveAll(filepath.Join(repoPath, "media")); err != nil {
976+
return fmt.Errorf("reset media dir: %w", err)
977+
}
978+
return nil
979+
}
980+
964981
func validateMediaRoots(opts Options) error {
965982
if strings.TrimSpace(opts.CacheDir) == "" {
966983
return nil
@@ -1030,11 +1047,11 @@ func importMedia(ctx context.Context, opts Options, manifest *MediaManifest) (in
10301047
if err := ctx.Err(); err != nil {
10311048
return copied, err
10321049
}
1033-
mediaPath, ok := strings.CutPrefix(filepath.ToSlash(file.Path), "media/")
1050+
mediaPath, compressed, ok := mediaPathFromManifest(file.Path)
10341051
if !ok || strings.TrimSpace(mediaPath) == "" {
10351052
return copied, fmt.Errorf("invalid media manifest path %q", file.Path)
10361053
}
1037-
source, err := media.RepoPath(opts.RepoPath, mediaPath)
1054+
source, err := mediaSourcePath(opts.RepoPath, mediaPath, compressed)
10381055
if err != nil {
10391056
return copied, err
10401057
}
@@ -1059,17 +1076,55 @@ func importMedia(ctx context.Context, opts Options, manifest *MediaManifest) (in
10591076
if err != nil {
10601077
return copied, err
10611078
}
1062-
if sameFileHash(target, hash) {
1079+
targetHash := hash
1080+
if compressed {
1081+
targetHash, err = gzipFileSHA256(source)
1082+
if err != nil {
1083+
return copied, fmt.Errorf("hash compressed media %s: %w", file.Path, err)
1084+
}
1085+
}
1086+
if sameFileHash(target, targetHash) {
10631087
continue
10641088
}
1065-
if err := copyFile(target, source); err != nil {
1089+
if compressed {
1090+
err = restoreGzipFile(target, source)
1091+
} else {
1092+
err = copyFile(target, source)
1093+
}
1094+
if err != nil {
10661095
return copied, fmt.Errorf("restore media %s: %w", file.Path, err)
10671096
}
10681097
copied++
10691098
}
10701099
return copied, nil
10711100
}
10721101

1102+
func compressedMediaManifestPath(mediaPath string) string {
1103+
return filepath.ToSlash(filepath.Join("media", mediaPath+".gz"))
1104+
}
1105+
1106+
func mediaPathFromManifest(path string) (string, bool, bool) {
1107+
mediaPath, ok := strings.CutPrefix(filepath.ToSlash(path), "media/")
1108+
if !ok {
1109+
return "", false, false
1110+
}
1111+
if strings.HasSuffix(mediaPath, ".gz") {
1112+
return strings.TrimSuffix(mediaPath, ".gz"), true, true
1113+
}
1114+
return mediaPath, false, true
1115+
}
1116+
1117+
func compressedMediaRepoPath(repoPath, mediaPath string) (string, error) {
1118+
return media.RepoPath(repoPath, mediaPath+".gz")
1119+
}
1120+
1121+
func mediaSourcePath(repoPath, mediaPath string, compressed bool) (string, error) {
1122+
if compressed {
1123+
return compressedMediaRepoPath(repoPath, mediaPath)
1124+
}
1125+
return media.RepoPath(repoPath, mediaPath)
1126+
}
1127+
10731128
func regularMediaFile(root, path, label string) (os.FileInfo, error) {
10741129
root = filepath.Clean(root)
10751130
path = filepath.Clean(path)
@@ -1145,6 +1200,83 @@ func copyFile(target, source string) error {
11451200
return nil
11461201
}
11471202

1203+
func copyGzipFile(target, source string) error {
1204+
if err := os.MkdirAll(filepath.Dir(target), 0o755); err != nil {
1205+
return err
1206+
}
1207+
src, err := os.Open(source) // #nosec G304 -- source is constrained by media path helpers.
1208+
if err != nil {
1209+
return err
1210+
}
1211+
defer func() { _ = src.Close() }()
1212+
tmp, err := os.CreateTemp(filepath.Dir(target), ".copy-*")
1213+
if err != nil {
1214+
return err
1215+
}
1216+
tmpPath := tmp.Name()
1217+
gz, err := gzip.NewWriterLevel(tmp, gzip.BestCompression)
1218+
if err != nil {
1219+
_ = tmp.Close()
1220+
_ = os.Remove(tmpPath)
1221+
return err
1222+
}
1223+
if _, err := io.Copy(gz, src); err != nil {
1224+
_ = gz.Close()
1225+
_ = tmp.Close()
1226+
_ = os.Remove(tmpPath)
1227+
return err
1228+
}
1229+
if err := gz.Close(); err != nil {
1230+
_ = tmp.Close()
1231+
_ = os.Remove(tmpPath)
1232+
return err
1233+
}
1234+
if err := tmp.Close(); err != nil {
1235+
_ = os.Remove(tmpPath)
1236+
return err
1237+
}
1238+
if err := os.Rename(tmpPath, target); err != nil {
1239+
_ = os.Remove(tmpPath)
1240+
return err
1241+
}
1242+
return nil
1243+
}
1244+
1245+
func restoreGzipFile(target, source string) error {
1246+
if err := os.MkdirAll(filepath.Dir(target), 0o755); err != nil {
1247+
return err
1248+
}
1249+
src, err := os.Open(source) // #nosec G304 -- source is constrained by media path helpers.
1250+
if err != nil {
1251+
return err
1252+
}
1253+
defer func() { _ = src.Close() }()
1254+
gz, err := gzip.NewReader(src)
1255+
if err != nil {
1256+
return err
1257+
}
1258+
defer func() { _ = gz.Close() }()
1259+
tmp, err := os.CreateTemp(filepath.Dir(target), ".copy-*")
1260+
if err != nil {
1261+
return err
1262+
}
1263+
tmpPath := tmp.Name()
1264+
if _, err := io.Copy(tmp, gz); err != nil {
1265+
_ = tmp.Close()
1266+
_ = os.Remove(tmpPath)
1267+
return err
1268+
}
1269+
if err := tmp.Close(); err != nil {
1270+
_ = os.Remove(tmpPath)
1271+
return err
1272+
}
1273+
if err := os.Rename(tmpPath, target); err != nil {
1274+
_ = os.Remove(tmpPath)
1275+
return err
1276+
}
1277+
return nil
1278+
}
1279+
11481280
func fileSHA256(path string) (string, error) {
11491281
file, err := os.Open(path) // #nosec G304 -- callers pass confined repo/cache paths.
11501282
if err != nil {
@@ -1158,6 +1290,24 @@ func fileSHA256(path string) (string, error) {
11581290
return hex.EncodeToString(hasher.Sum(nil)), nil
11591291
}
11601292

1293+
func gzipFileSHA256(path string) (string, error) {
1294+
file, err := os.Open(path) // #nosec G304 -- callers pass confined repo/cache paths.
1295+
if err != nil {
1296+
return "", err
1297+
}
1298+
defer func() { _ = file.Close() }()
1299+
gz, err := gzip.NewReader(file)
1300+
if err != nil {
1301+
return "", err
1302+
}
1303+
defer func() { _ = gz.Close() }()
1304+
hasher := sha256.New()
1305+
if _, err := io.Copy(hasher, gz); err != nil {
1306+
return "", err
1307+
}
1308+
return hex.EncodeToString(hasher.Sum(nil)), nil
1309+
}
1310+
11611311
func sameFileHash(path, hash string) bool {
11621312
current, err := fileSHA256(path)
11631313
return err == nil && current == hash

0 commit comments

Comments
 (0)