Google Takeout attachments with long filenames don't resolve (Takeout truncates the on-disk name)

## Summary

Google Chat (and likely Voice) attachments don't resolve when the on-disk filename is **longer than ~51 characters**. Google Takeout stores the *full* filename in `messages.json` but **truncates the actual file written to disk**, so our exact-name lookup (`dir.join(export_name)`) misses and the attachment is recorded as missing.

Surfaced while fixing #62; visible in the manual-e2e golden as `chat_attachments.last_error = "attachment file missing on disk"`.

## Evidence (real export)

In `Google Chat/Groups/DM 2ZwEI8AAAAE/`:

```
messages.json export_name : File-84f97831-a19d-4d1b-89f0-dfc9cbd19ca4-1_all_148157.jpeg   (59 chars)
messages.json original_name:     84f97831-a19d-4d1b-89f0-dfc9cbd19ca4-1_all_148157.jpeg
actual file on disk        : File-84f97831-a19d-4d1b-89f0-dfc9cbd19ca4-1_al.jpeg          (51 chars)
```

The on-disk name is the JSON name **truncated to 51 chars total, preserving the `.jpeg` extension** (46-char stem + 5-char ext). The disk stem is a strict prefix of the JSON stem.

## Where it breaks

`frankweiler/backend/etl/providers/google_takeout/src/extract/google_chat.rs` — `let attach_path = dir.join(&export_name); std::fs::read(&attach_path)`. Google Voice has the analogous `voice_attachment_missing` path.

## It's fixable — moderately fiddly, not hard

The truncation is deterministic-ish (a length cap that preserves the extension), so a fallback resolver is workable:

1. Try the exact `dir.join(export_name)` (fast path, unchanged).
2. On miss, scan the message's dir for a file `F` where `F.extension == export_name.extension` **and** `export_stem.starts_with(F.stem)` (i.e. the on-disk truncated stem is a prefix of the full JSON stem).
3. If exactly one candidate → use it. If zero → record missing as today. If >1 → log ambiguity and skip (don't guess).

Notes / sharp edges:
- **Don't hardcode 51.** The cap may differ by export/OS/zip tool; prefix-matching avoids depending on the exact number.
- **Collisions**: two attachments in one dir whose names share a >stem-length prefix would be ambiguous. Rare (few attachments per message), but the unique-match guard handles it safely.
- **Unicode/byte vs char length**: the cap looks like a byte budget; non-ASCII names need byte-aware prefix logic.
- Could also cross-check against `original_name` (`disk == ("File-" + original_name)` truncated), giving a second matching signal.
- Needs a test fixture with a deliberately long+truncated attachment name (the current TNG fixture uses a short name that resolves exactly).

## Scope
- Google Chat attachments (confirmed).
- Audit Google Voice (`google_voice/mod.rs`) for the same truncation; very likely affected.

Not urgent — attachments degrade to "missing" rather than corrupting data — but it silently drops real media for any longish filename.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Google Takeout attachments with long filenames don't resolve (Takeout truncates the on-disk name) #64

Summary

Evidence (real export)

Where it breaks

It's fixable — moderately fiddly, not hard

Scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Google Takeout attachments with long filenames don't resolve (Takeout truncates the on-disk name) #64

Description

Summary

Evidence (real export)

Where it breaks

It's fixable — moderately fiddly, not hard

Scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions