After the 5-scout fan-out (BBL, AOT, CPALL, SCC, SCB), several edge cases were observed in real Thai SEC filings that the v0 parser/schema doesn't handle. Capturing here as a v0.2 punch list. Full evidence in `notes/scout-*.md`.
Parser-side
Schema additions
Coverage limitations to surface to users
Observations to inform the consolidator merge
When the eventual coordinator-merger consolidates the 5 staging `data/concepts..csv` and `data/auditors..csv` files into the canonical CSVs:
- ~150 alias-append proposals across the 5 scouts (existing concepts with new Thai phrasings observed)
- ~110 new concept proposals (with industry tags), heavily skewed bank+industrial-conglomerate
- 5 distinct auditor firm legal entities (KPMG Phoomchai for SCC+SCB+CPALL — confirms KPMG dominance for industrials/retail; Deloitte Touche Tohmatsu Jaiyos for BBL; SAO + EY + KPMG for AOT). KPMG covers 3 of the 5 scout symbols — auditor diversity hypothesis was partially validated (3 of 4 Big-4 + SAO).
Open: should the consolidator be a script (`scripts/merge_scout_findings.py`) that auto-applies non-conflicting alias appends and proposes new concepts as a separate review-required PR? Or a manual editorial pass?
After the 5-scout fan-out (BBL, AOT, CPALL, SCC, SCB), several edge cases were observed in real Thai SEC filings that the v0 parser/schema doesn't handle. Capturing here as a v0.2 punch list. Full evidence in `notes/scout-*.md`.
Parser-side
Currency-scale detection (highest data-corruption risk). Quarterly filings often use `(พันบาท)` (thousands of baht) while annual filings use `(บาท)` — 1000× scale difference within the same symbol. Anchor: unit marker appears in the row immediately under the column-headers row of each XLS sheet. Without per-filing detection, CPALL Q1 figures are 3 orders of magnitude off vs annual. Should auto-detect `(พันบาท)` / `(ล้านบาท)` / `(บาท)` and multiply numeric cells before emit. Persist detected unit in `filings.parquet` as a new `currency_scale` column. Source: scout-CPALL.md.
Pre-2003 / pre-2009 legacy filing formats. Two distinct legacy patterns observed:
Combined, ~1997-2007 of long-listed names (PTT/BBL/BAY/AOT/SCC/CPALL) needs custom handling OR explicit `parse_status='unsupported_legacy_'` markers in `filings.parquet` to make the gap visible.
Continuation-row sheet pattern. SCC mid-era filings combine multiple statements in one sheet — 2010 sheet `งบดุล-งบกำไรขาดทุน` has BS rows 1-89 and IS rows 91+ in the same sheet. Sheet-name → statement classifier needs to handle 'one sheet → multiple statements' or fall back to row-pattern detection. Source: scout-SCC.md.
OCI appended to IS sheet (TFRS 1 / IFRS 1 disclosure pattern). BBL post-2013 filings put OCI directly underneath the IS in the same sheet — split at `กำไร (ขาดทุน) เบ็ดเสร็จอื่น`. Source: scout-BBL.md.
Hidden / artifact sheet filtering. CPALL 2026Q1 has `DS_INTERNAL_*` (DataSnipper artifact) sheets; AOT 2020+ has `com.sap.ip.bi.xl.hiddensheet` (SAP BI export artifact); FY2025 sheet names contain literal `EY` strings. Skip any sheet whose name starts with `` or `DS_INTERNAL`. Source: scout-CPALL.md, scout-AOT.md.
Multiple-period quarterly P&L sheets. SCB Q3 quarterlies have TWO P&L sheets (`PL (3M)` and `PL (9M)`) — emit-both vs canonical-3M is a design call. Source: scout-SCB.md.
Sheet-name evolution patterns for the BS sheet across decades:
Non-canonical zip filenames. 2002-2008 era zips contain `T1.doc / T2.xls / T3.doc` instead of `AUDITOR_REPORT.DOC / FINANCIAL_STATEMENTS.XLS / NOTES.DOC`. AOT 2004 and BBL 2005-era both observed. Dispatch by extension+content not filename. Source: scout-BBL.md, scout-AOT.md.
Bank XLS column layout. Section headers are in column A, leaf labels in column C — parser must walk both columns when extracting line items. Source: scout-BBL.md.
Variable BS date columns. CPALL FY2014/FY2015 BS has a 3rd date column (`1 มกราคม 2557`) for TFRS opening-balance restatement. Parser must handle variable column count, not hardcode 4 period columns. Source: scout-CPALL.md.
Schema additions
`emphasis_of_matter` boolean on `auditor_reports.parquet` — distinct from `going_concern_emphasis`. AOT FY2025 has an EOM on the King Power duty-free renegotiation; SCB FY2022 has an EOM on the SCBX restructure. Source: scout-AOT.md, scout-SCB.md.
`auditor_signatures` table for joint audits and partner rotation. BBL had 5 partners under Deloitte over 30 years; SCC had 4 partners under KPMG. Per-filing signing partner is currently lost. Source: scout-BBL.md, scout-SCC.md.
Ticker restructure tracking. SCB ticker silently spans two legal entities (Siam Commercial Bank → SCB X holding co. in 2022). Symbol is the same but the entity name in the BS title row changes. Surface entity-name in `filings.parquet` so consumers can detect mid-ticker re-domiciliation. Source: scout-SCB.md.
`parse_status` column on `filings.parquet` to mark unsupported-legacy filings rather than silently producing 0 rows. Suggested values: `ok` / `unsupported_legacy_text` / `unsupported_legacy_doc_only` / `unsupported_unknown`. Source: scout-SCC.md, scout-CPALL.md.
`is_government_filing` per filing (not per firm). AOT auditor rotated SAO → EY → KPMG within 12 months; `is_government=true` on the firm doesn't cover symbol-level state-enterprise status. Need separate symbol-level `is_state_enterprise` flag. Source: scout-AOT.md.
Coverage limitations to surface to users
fs-norm endpoint coverage limit. SCB only exposes 17 filings (all 2022Q1+) despite SCB being listed since 1976. The endpoint may not return pre-restructure history for some symbols. Discovery layer should log per-symbol `earliest_period_observed` and surface this in the dataset README so consumers know the coverage isn't always 'IPO-to-present'. Cross-check: BBL returns 234 filings going back to 1997, so the limit isn't universal. Source: scout-SCB.md.
Per-industry coverage gate (already tracked at [v2 follow-up] Per-industry coverage gate (banks/REITs would silently tank the global gate) #19). Banks have zero `revenue`/`cogs` coverage by design (single-step income statement); aggregate coverage gate would spuriously fail.
Observations to inform the consolidator merge
When the eventual coordinator-merger consolidates the 5 staging `data/concepts..csv` and `data/auditors..csv` files into the canonical CSVs:
Open: should the consolidator be a script (`scripts/merge_scout_findings.py`) that auto-applies non-conflicting alias appends and proposes new concepts as a separate review-required PR? Or a manual editorial pass?