Skip to content

Google Takeout attachments with long filenames don't resolve (Takeout truncates the on-disk name) #64

Description

@thad-imbue

Summary

Google Chat (and likely Voice) attachments don't resolve when the on-disk filename is longer than ~51 characters. Google Takeout stores the full filename in messages.json but truncates the actual file written to disk, so our exact-name lookup (dir.join(export_name)) misses and the attachment is recorded as missing.

Surfaced while fixing #62; visible in the manual-e2e golden as chat_attachments.last_error = "attachment file missing on disk".

Evidence (real export)

In Google Chat/Groups/DM 2ZwEI8AAAAE/:

messages.json export_name : File-84f97831-a19d-4d1b-89f0-dfc9cbd19ca4-1_all_148157.jpeg   (59 chars)
messages.json original_name:     84f97831-a19d-4d1b-89f0-dfc9cbd19ca4-1_all_148157.jpeg
actual file on disk        : File-84f97831-a19d-4d1b-89f0-dfc9cbd19ca4-1_al.jpeg          (51 chars)

The on-disk name is the JSON name truncated to 51 chars total, preserving the .jpeg extension (46-char stem + 5-char ext). The disk stem is a strict prefix of the JSON stem.

Where it breaks

frankweiler/backend/etl/providers/google_takeout/src/extract/google_chat.rslet attach_path = dir.join(&export_name); std::fs::read(&attach_path). Google Voice has the analogous voice_attachment_missing path.

It's fixable — moderately fiddly, not hard

The truncation is deterministic-ish (a length cap that preserves the extension), so a fallback resolver is workable:

  1. Try the exact dir.join(export_name) (fast path, unchanged).
  2. On miss, scan the message's dir for a file F where F.extension == export_name.extension and export_stem.starts_with(F.stem) (i.e. the on-disk truncated stem is a prefix of the full JSON stem).
  3. If exactly one candidate → use it. If zero → record missing as today. If >1 → log ambiguity and skip (don't guess).

Notes / sharp edges:

  • Don't hardcode 51. The cap may differ by export/OS/zip tool; prefix-matching avoids depending on the exact number.
  • Collisions: two attachments in one dir whose names share a >stem-length prefix would be ambiguous. Rare (few attachments per message), but the unique-match guard handles it safely.
  • Unicode/byte vs char length: the cap looks like a byte budget; non-ASCII names need byte-aware prefix logic.
  • Could also cross-check against original_name (disk == ("File-" + original_name) truncated), giving a second matching signal.
  • Needs a test fixture with a deliberately long+truncated attachment name (the current TNG fixture uses a short name that resolves exactly).

Scope

  • Google Chat attachments (confirmed).
  • Audit Google Voice (google_voice/mod.rs) for the same truncation; very likely affected.

Not urgent — attachments degrade to "missing" rather than corrupting data — but it silently drops real media for any longish filename.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions