Skip to content

[v0.2 follow-ups] Parser robustness gaps surfaced by 5-scout fan-out (BBL/AOT/CPALL/SCC/SCB) #20

@ninyawee

Description

@ninyawee

After the 5-scout fan-out (BBL, AOT, CPALL, SCC, SCB), several edge cases were observed in real Thai SEC filings that the v0 parser/schema doesn't handle. Capturing here as a v0.2 punch list. Full evidence in `notes/scout-*.md`.

Parser-side

  • Currency-scale detection (highest data-corruption risk). Quarterly filings often use `(พันบาท)` (thousands of baht) while annual filings use `(บาท)` — 1000× scale difference within the same symbol. Anchor: unit marker appears in the row immediately under the column-headers row of each XLS sheet. Without per-filing detection, CPALL Q1 figures are 3 orders of magnitude off vs annual. Should auto-detect `(พันบาท)` / `(ล้านบาท)` / `(บาท)` and multiply numeric cells before emit. Persist detected unit in `filings.parquet` as a new `currency_scale` column. Source: scout-CPALL.md.

  • Pre-2003 / pre-2009 legacy filing formats. Two distinct legacy patterns observed:

    • TIS-620 plain text: SCC 1997 filings are single `.t97/.t98` files (no zip, no XLS, no DOC). `zipfile.ZipFile` raises `BadZipFile`. Source: scout-SCC.md.
    • DOC-only with Word tables: CPALL FY2003 zip has no XLS files; financial statements are tables inside `t2.doc`. Pipeline assumes XLSX input → silently produces 0 rows. Source: scout-CPALL.md.
      Combined, ~1997-2007 of long-listed names (PTT/BBL/BAY/AOT/SCC/CPALL) needs custom handling OR explicit `parse_status='unsupported_legacy_
      '` markers in `filings.parquet` to make the gap visible.
  • Continuation-row sheet pattern. SCC mid-era filings combine multiple statements in one sheet — 2010 sheet `งบดุล-งบกำไรขาดทุน` has BS rows 1-89 and IS rows 91+ in the same sheet. Sheet-name → statement classifier needs to handle 'one sheet → multiple statements' or fall back to row-pattern detection. Source: scout-SCC.md.

  • OCI appended to IS sheet (TFRS 1 / IFRS 1 disclosure pattern). BBL post-2013 filings put OCI directly underneath the IS in the same sheet — split at `กำไร (ขาดทุน) เบ็ดเสร็จอื่น`. Source: scout-BBL.md.

  • Hidden / artifact sheet filtering. CPALL 2026Q1 has `DS_INTERNAL_*` (DataSnipper artifact) sheets; AOT 2020+ has `com.sap.ip.bi.xl.hiddensheet` (SAP BI export artifact); FY2025 sheet names contain literal `EY` strings. Skip any sheet whose name starts with `` or `DS_INTERNAL`. Source: scout-CPALL.md, scout-AOT.md.

  • Multiple-period quarterly P&L sheets. SCB Q3 quarterlies have TWO P&L sheets (`PL (3M)` and `PL (9M)`) — emit-both vs canonical-3M is a design call. Source: scout-SCB.md.

  • Sheet-name evolution patterns for the BS sheet across decades:

    • `งบดุล` (pre-2008) → `งบแสดงฐานะการเงิน` (post-TFRS 1) → `งบฐานะการเงิน` (recent shorthand).
    • Use regex `r"^(งบ)?(ดุล|แสดง?ฐานะการเงิน|ฐานะการเงิน)"` rather than exact match. Source: scout-BBL.md.
  • Non-canonical zip filenames. 2002-2008 era zips contain `T1.doc / T2.xls / T3.doc` instead of `AUDITOR_REPORT.DOC / FINANCIAL_STATEMENTS.XLS / NOTES.DOC`. AOT 2004 and BBL 2005-era both observed. Dispatch by extension+content not filename. Source: scout-BBL.md, scout-AOT.md.

  • Bank XLS column layout. Section headers are in column A, leaf labels in column C — parser must walk both columns when extracting line items. Source: scout-BBL.md.

  • Variable BS date columns. CPALL FY2014/FY2015 BS has a 3rd date column (`1 มกราคม 2557`) for TFRS opening-balance restatement. Parser must handle variable column count, not hardcode 4 period columns. Source: scout-CPALL.md.

Schema additions

  • `emphasis_of_matter` boolean on `auditor_reports.parquet` — distinct from `going_concern_emphasis`. AOT FY2025 has an EOM on the King Power duty-free renegotiation; SCB FY2022 has an EOM on the SCBX restructure. Source: scout-AOT.md, scout-SCB.md.

  • `auditor_signatures` table for joint audits and partner rotation. BBL had 5 partners under Deloitte over 30 years; SCC had 4 partners under KPMG. Per-filing signing partner is currently lost. Source: scout-BBL.md, scout-SCC.md.

  • Ticker restructure tracking. SCB ticker silently spans two legal entities (Siam Commercial Bank → SCB X holding co. in 2022). Symbol is the same but the entity name in the BS title row changes. Surface entity-name in `filings.parquet` so consumers can detect mid-ticker re-domiciliation. Source: scout-SCB.md.

  • `parse_status` column on `filings.parquet` to mark unsupported-legacy filings rather than silently producing 0 rows. Suggested values: `ok` / `unsupported_legacy_text` / `unsupported_legacy_doc_only` / `unsupported_unknown`. Source: scout-SCC.md, scout-CPALL.md.

  • `is_government_filing` per filing (not per firm). AOT auditor rotated SAO → EY → KPMG within 12 months; `is_government=true` on the firm doesn't cover symbol-level state-enterprise status. Need separate symbol-level `is_state_enterprise` flag. Source: scout-AOT.md.

Coverage limitations to surface to users

  • fs-norm endpoint coverage limit. SCB only exposes 17 filings (all 2022Q1+) despite SCB being listed since 1976. The endpoint may not return pre-restructure history for some symbols. Discovery layer should log per-symbol `earliest_period_observed` and surface this in the dataset README so consumers know the coverage isn't always 'IPO-to-present'. Cross-check: BBL returns 234 filings going back to 1997, so the limit isn't universal. Source: scout-SCB.md.

  • Per-industry coverage gate (already tracked at [v2 follow-up] Per-industry coverage gate (banks/REITs would silently tank the global gate) #19). Banks have zero `revenue`/`cogs` coverage by design (single-step income statement); aggregate coverage gate would spuriously fail.

Observations to inform the consolidator merge

When the eventual coordinator-merger consolidates the 5 staging `data/concepts..csv` and `data/auditors..csv` files into the canonical CSVs:

  • ~150 alias-append proposals across the 5 scouts (existing concepts with new Thai phrasings observed)
  • ~110 new concept proposals (with industry tags), heavily skewed bank+industrial-conglomerate
  • 5 distinct auditor firm legal entities (KPMG Phoomchai for SCC+SCB+CPALL — confirms KPMG dominance for industrials/retail; Deloitte Touche Tohmatsu Jaiyos for BBL; SAO + EY + KPMG for AOT). KPMG covers 3 of the 5 scout symbols — auditor diversity hypothesis was partially validated (3 of 4 Big-4 + SAO).

Open: should the consolidator be a script (`scripts/merge_scout_findings.py`) that auto-applies non-conflicting alias appends and proposes new concepts as a separate review-required PR? Or a manual editorial pass?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions