Summary
Google Chat (and likely Voice) attachments don't resolve when the on-disk filename is longer than ~51 characters. Google Takeout stores the full filename in messages.json but truncates the actual file written to disk, so our exact-name lookup (dir.join(export_name)) misses and the attachment is recorded as missing.
Surfaced while fixing #62; visible in the manual-e2e golden as chat_attachments.last_error = "attachment file missing on disk".
Evidence (real export)
In Google Chat/Groups/DM 2ZwEI8AAAAE/:
messages.json export_name : File-84f97831-a19d-4d1b-89f0-dfc9cbd19ca4-1_all_148157.jpeg (59 chars)
messages.json original_name: 84f97831-a19d-4d1b-89f0-dfc9cbd19ca4-1_all_148157.jpeg
actual file on disk : File-84f97831-a19d-4d1b-89f0-dfc9cbd19ca4-1_al.jpeg (51 chars)
The on-disk name is the JSON name truncated to 51 chars total, preserving the .jpeg extension (46-char stem + 5-char ext). The disk stem is a strict prefix of the JSON stem.
Where it breaks
frankweiler/backend/etl/providers/google_takeout/src/extract/google_chat.rs — let attach_path = dir.join(&export_name); std::fs::read(&attach_path). Google Voice has the analogous voice_attachment_missing path.
It's fixable — moderately fiddly, not hard
The truncation is deterministic-ish (a length cap that preserves the extension), so a fallback resolver is workable:
- Try the exact
dir.join(export_name) (fast path, unchanged).
- On miss, scan the message's dir for a file
F where F.extension == export_name.extension and export_stem.starts_with(F.stem) (i.e. the on-disk truncated stem is a prefix of the full JSON stem).
- If exactly one candidate → use it. If zero → record missing as today. If >1 → log ambiguity and skip (don't guess).
Notes / sharp edges:
- Don't hardcode 51. The cap may differ by export/OS/zip tool; prefix-matching avoids depending on the exact number.
- Collisions: two attachments in one dir whose names share a >stem-length prefix would be ambiguous. Rare (few attachments per message), but the unique-match guard handles it safely.
- Unicode/byte vs char length: the cap looks like a byte budget; non-ASCII names need byte-aware prefix logic.
- Could also cross-check against
original_name (disk == ("File-" + original_name) truncated), giving a second matching signal.
- Needs a test fixture with a deliberately long+truncated attachment name (the current TNG fixture uses a short name that resolves exactly).
Scope
- Google Chat attachments (confirmed).
- Audit Google Voice (
google_voice/mod.rs) for the same truncation; very likely affected.
Not urgent — attachments degrade to "missing" rather than corrupting data — but it silently drops real media for any longish filename.
Summary
Google Chat (and likely Voice) attachments don't resolve when the on-disk filename is longer than ~51 characters. Google Takeout stores the full filename in
messages.jsonbut truncates the actual file written to disk, so our exact-name lookup (dir.join(export_name)) misses and the attachment is recorded as missing.Surfaced while fixing #62; visible in the manual-e2e golden as
chat_attachments.last_error = "attachment file missing on disk".Evidence (real export)
In
Google Chat/Groups/DM 2ZwEI8AAAAE/:The on-disk name is the JSON name truncated to 51 chars total, preserving the
.jpegextension (46-char stem + 5-char ext). The disk stem is a strict prefix of the JSON stem.Where it breaks
frankweiler/backend/etl/providers/google_takeout/src/extract/google_chat.rs—let attach_path = dir.join(&export_name); std::fs::read(&attach_path). Google Voice has the analogousvoice_attachment_missingpath.It's fixable — moderately fiddly, not hard
The truncation is deterministic-ish (a length cap that preserves the extension), so a fallback resolver is workable:
dir.join(export_name)(fast path, unchanged).FwhereF.extension == export_name.extensionandexport_stem.starts_with(F.stem)(i.e. the on-disk truncated stem is a prefix of the full JSON stem).Notes / sharp edges:
original_name(disk == ("File-" + original_name)truncated), giving a second matching signal.Scope
google_voice/mod.rs) for the same truncation; very likely affected.Not urgent — attachments degrade to "missing" rather than corrupting data — but it silently drops real media for any longish filename.