PRD: CI/CD pipeline to publish Thai SEC financial statements as a HuggingFace dataset, exposed via thaifin v2

> Generated from a `/grill-with-docs` session. Locked design decisions captured in [`notes/cicd-design.md`](../tree/master/notes/cicd-design.md), domain language in [`CONTEXT.md`](../tree/master/CONTEXT.md), schema rationale in [`docs/adr/0001-tagged-long-format.md`](../tree/master/docs/adr/0001-tagged-long-format.md).

## Problem Statement

Today, every call to `Stock("PTT").quarter_dataframe` reaches out to Finnomena's live HTTP API. That API exposes ~35 aggregated metrics (revenue, NPM, total `InvestingActivities`) but no balance-sheet, P&L, or cash-flow **line items**. Researchers and quants who want **CapEx**, **PP&E movements**, **detailed cash flow breakdowns**, **auditor opinions**, or **the notes to financial statements** — i.e., the data needed for any serious fundamental analysis of Thai listed companies — have no path forward through `thaifin` as it stands.

Worse, the library is fully coupled to live third-party endpoints. If Finnomena rate-limits, breaks an endpoint, or sunsets the service, every consumer breaks. There is no canonical, citable, version-pinnable corpus a researcher can reference in a paper or notebook.

Thai SEC publishes the full financial filings (XLS + DOC + auditor report, all four primary statements, going back to each company's IPO) for free at `market.sec.or.th/public/idisc/`. The data is there. There is no tool that processes it into a clean, queryable, redistributable form.

## Solution

Build a CI/CD pipeline that **monthly** scrapes Thai SEC IDISC, parses every available filing for every SET-listed company, and publishes a versioned **research-grade dataset** to HuggingFace at `hf.co/datasets/thaifin/financials`.

The dataset is shaped as **Tagged long-format**: one row per `(symbol, period, statement, concept, raw_label_th, value, audit_basis, consolidation, filing_id)`, with a sidecar **concept dictionary** that maps short curated IDs (`capex`, `revenue`, `ebitda`) to the long-tail of original Thai labels — and carries optional XBRL refs for forward compatibility with global IFRS datasets.

The `thaifin` library v2 grows a `source` parameter (`"dataset"` default, `"live"` opt-in). With `source="dataset"`, all data reads go through **DuckDB streaming over HTTP** against the HuggingFace-hosted parquet — no local cache by default, with session-level memoization to avoid foot-guns and an opt-in `thaifin.download_dataset()` for offline use. New convenience properties (`.capex`, `.cash_flow_statement`, `.auditor_report`, `.notes`) ride on top.

End state: researchers in Python, R, Julia, or any DuckDB-capable tool can pull a pinned revision of the dataset; `thaifin` users get richer data and a non-breaking migration; the library no longer fails when a third-party API hiccups.

## User Stories

1. As a quant researcher, I want to retrieve CapEx for any SET-listed company over the past 10+ years, so that I can compute free cash flow without subscribing to Bloomberg or SETSMART.

2. As a Python user, I want to call `Stock("PTT").capex` and get a pandas Series indexed by period, so that I don't have to learn the underlying schema before getting useful output.

3. As a Python user, I want `Stock("PTT").quarter_dataframe` to keep working when I upgrade to v2, so that my existing scripts don't break.

4. As a Python user, I want to opt back into live Finnomena data with `Stock("PTT", source="live")`, so that I can get fresher-than-monthly figures when I need them.

5. As a researcher writing a paper, I want to cite a specific dataset revision (e.g., `thaifin/financials @ 2026.05`), so that my analysis is reproducible.

6. As a non-Python researcher (R / Julia / DuckDB CLI / Observable), I want to query the dataset directly from HuggingFace URLs, so that I can use it without installing `thaifin`.

7. As a financial analyst, I want to filter rows by `consolidation = 'consolidated'` vs `'company'`, so that I can compare parent-only and group performance.

8. As a financial analyst, I want to filter by `audit_basis = 'audited'`, so that I can exclude reviewed-only quarterly numbers when rigor matters.

9. As a credit analyst, I want to read the **auditor's opinion type** for any filing, so that I can flag companies with qualified opinions or going-concern emphases of matter.

10. As a credit analyst, I want a boolean `going_concern_emphasis` column on `auditor_reports`, so that I can screen the whole market for distress signals in one query.

11. As a researcher, I want full-text **notes** (NOTES.DOC) preserved as markdown alongside the numeric data, so that I can search related-party transactions, segment reporting, or accounting policy changes.

12. As a researcher comparing Thai companies to global peers, I want each curated concept to optionally carry an IFRS XBRL ref, so that I can cross-walk to international datasets.

13. As a contributor, I want to propose new concepts or fix label mappings via a PR against `data/concepts.csv`, so that the dictionary evolves without becoming a maintainer bottleneck.

14. As a contributor, I want CI to fail when concept coverage drops vs the previous build, so that mapping regressions are caught before they ship.

15. As a maintainer, I want the pipeline to incrementally fetch only new filings (diffed against a committed `state.json`), so that monthly builds don't re-download 20 GB every time.

16. As a maintainer, I want the pipeline to be triggerable manually via `workflow_dispatch`, so that I can rebuild on demand for off-cycle filings or hotfixes.

17. As a maintainer, I want failed builds to leave the previous HuggingFace revision intact, so that a parser crash never silently corrupts the public dataset.

18. As a maintainer, I want each filing's `source_url` and `source_sha256` recorded in `filings.parquet`, so that any numeric anomaly can be traced back to the exact regulatory document.

19. As an offline user (CI sandbox / spotty WiFi / corporate firewall), I want `thaifin.download_dataset()` to bulk-fetch the parquet bundle once, so that I can work without network for the rest of the session.

20. As an interactive notebook user, I want repeated calls like `Stock("PTT").capex` and `Stock("PTT").revenue` to hit the network only on the first call within a session, so that exploratory analysis stays fast.

21. As a library user, I want to pin a specific dataset revision via `thaifin.set_data_revision("2026.05")`, so that my notebook produces the same numbers every time I rerun it.

22. As an investor scanning across companies, I want to query e.g. "all SET-listed companies with CapEx > 10B THB in FY2025" in a single DuckDB query against the parquet, so that I don't have to write 800 individual API calls.

23. As a library user, I want raw Thai labels (`raw_label_th`) preserved on every row, so that I can audit a concept mapping or dive into long-tail lines the curated dictionary doesn't cover.

24. As a contributor curating the concept dictionary, I want each concept to be assignable to a specific `statement` (BS/IS/CF/EQ), so that `revenue` never accidentally matches a balance-sheet row.

25. As a library user, I want `Stock("PTT").income_statement`, `.balance_sheet`, `.cash_flow_statement` properties that return pivoted (wide) DataFrames, so that I don't have to learn the long-format schema for basic use.

26. As a maintainer parsing decade-old filings, I want a single parser code path regardless of whether the source filing is legacy `.XLS/.DOC` or modern `.XLSX/.DOCX`, so that I don't maintain two divergent parsers as edge cases accumulate.

27. As a library user pinning to v2, I want a clear CHANGELOG note explaining the column-name realignment between v1.x Finnomena output and the new dataset schema, so that I can migrate my code in one sitting.

## Implementation Decisions

### Pipeline modules (CI side)

The pipeline is sliced into deep, independently testable modules. Each has a narrow interface and encapsulates significant logic behind it.

- **FilingDiscovery** — Scrapes SEC IDISC. Two-stage HTTP crawl: (1) walk `/public/idisc/th/company/listed/{0-9,A-Z}` for the symbol universe; (2) per symbol, fetch `/public/idisc/th/Viewmore/fs-norm?searchSymbol=<S>` for filing zip URLs. Emits a manifest of `(symbol, filing_id, source_url, period, audit_basis, consolidation, filed_at)`. The `uniqueIDReference` parameter is *not* required — `searchSymbol` alone is sufficient (verified). Pure function over HTTP; testable with cached HTML fixtures.

- **FilingFetcher** — Incremental zip downloader. Inputs: manifest + `state.json` (filing_id → sha256). Outputs: paths to newly-fetched zips + updated state. Polite rate-limiting; resumable.

- **FormatNormalizer** — Wraps `libreoffice --headless --convert-to xlsx,docx`. Inputs: legacy `.XLS/.DOC`. Outputs: modern `.XLSX/.DOCX`. Downstream parsers never see legacy formats. Decision rationale: SEC is mid-transition (PTT Feb-2025 was `.XLS`, COM7 May-2025 was `.XLSX`); normalizing at ingest collapses two parser paths into one.

- **FinancialStatementParser** — Deep module. Inputs: path to `FINANCIAL_STATEMENTS.XLSX`. Outputs: list of `(statement, raw_label_th, value, consolidation)` rows. Encapsulates: sheet → statement classification (BS-Asset/Liability/Equity, PL_Accum, OCI_Accum, Cash flow), multi-period column extraction (each XLS contains 4 period columns: consolidated current+prior, company current+prior), value parsing (handles Thai numeric formatting, sign conventions, blanks), and per-statement quirks (e.g., negative cash outflows in CF section).

- **AuditorReportParser** — Deep module. Inputs: path to `AUDITOR_REPORT.DOCX`. Outputs: structured row `(auditor_firm, opinion_type, signing_date, signing_partner, going_concern_emphasis, raw_text_md)`. Encapsulates Thai-language heuristics for the four opinion types (`unqualified` / `qualified` / `adverse` / `disclaimer`) and going-concern paragraph detection.

- **NotesParser** — Shallow. Inputs: path to `NOTES.DOCX`. Outputs: markdown text only (no per-section structural extraction in v1).

- **ConceptMapper** — Deep, pure module. Inputs: list of `(statement, raw_label_th, ...)` rows + `concepts.parquet`. Outputs: rows enriched with `concept` column (nullable for unmapped). Encapsulates alias resolution and Thai-label normalization (whitespace, parentheticals).

- **DatasetBuilder** — Composition module. Joins parser outputs into the 5 published parquet tables (`financial_lines`, `concepts`, `auditor_reports`, `notes_text`, `filings`). Shallow; mostly orchestration.

- **HFPublisher** — Wraps `huggingface_hub` upload + git-tag revision. Shallow.

### Library modules

- **DatasetClient** — Deep, pure-interface module on the library side. Encapsulates: HF URL composition from `(revision, table)`, DuckDB connection management, parquet predicate-pushdown queries, session-level `lru_cache` memoization. Used by all dataset-backed `Stock` methods.

- **StockSourceRouter** — Shallow dispatch layer. `Stock.__init__` accepts `source="dataset"|"live"`; methods like `quarter_dataframe` delegate to `DatasetClient` or to the existing `Finnomena`/`ThaiSecuritiesData` services accordingly. New methods (`.capex`, `.notes`, `.auditor_report`, `.income_statement`, etc.) are dataset-only and raise a clear error if `source="live"`.

### Schema

Five published parquet tables. Detailed column specs live in `notes/cicd-design.md`; the load-bearing summary:

- **`financial_lines.parquet`** — long-format facts. PK is composite `(symbol, period, statement, concept, consolidation, audit_basis, filing_id)`. `concept` is nullable.
- **`concepts.parquet`** — ~120 curated concepts × `(statement, label_en, label_th, aliases_th, xbrl_ref)`. The editorial product. Sourced from a human-maintained `data/concepts.csv` in the repo; converted to parquet at build time.
- **`auditor_reports.parquet`** — one row per filing × structured fields + `raw_text_md`.
- **`notes_text.parquet`** — one row per filing × `raw_text_md` only.
- **`filings.parquet`** — provenance and incremental state: `filing_id`, source URL, sha256, fetch timestamp.

### Distribution

- Dataset host: `hf.co/datasets/thaifin/financials`.
- Revisions tagged per build: `2026.05`, `2026.06`, … (year.month).
- Cadence: `cron: '0 2 1 * *'` (1st of each month, 02:00 UTC). `workflow_dispatch` always available.
- The library pins a default revision per library release; users override with `thaifin.set_data_revision(...)`.

### Library API contract

```python
# Default — streams from the published dataset
Stock("PTT").capex                                # pd.Series indexed by period
Stock("PTT").quarter_dataframe                    # wide-pivoted; column names follow the new schema
Stock("PTT").cash_flow_statement                  # wide DataFrame
Stock("PTT").auditor_report                       # DataFrame, one row per period
Stock("PTT").notes                                # DataFrame with raw_text_md per period

# Opt back to live (legacy v1.x behavior)
Stock("PTT", source="live").quarter_dataframe     # hits Finnomena

# Pin a specific dataset version
Stock("PTT", revision="2026.05").capex
thaifin.set_data_revision("2026.05")              # module-wide default

# Offline / sandboxed
thaifin.download_dataset()                        # one-time bulk fetch into ~/.cache/thaifin/<revision>/

# Power-user query
thaifin.financial_lines()                         # pl.LazyFrame against the active revision
```

### Schema migration note

The new `quarter_dataframe` column names align with the tagged schema, not v1.x Finnomena names. This is a breaking change at the column level even though the API shape is preserved. The CHANGELOG will provide a one-page rename map.

### Dependencies

Add: `duckdb`, `huggingface_hub`, `python-docx`, `openpyxl` (already dev-dep, promote to runtime), `polars`. System-side in CI only: `libreoffice-core`, `libreoffice-calc`, `libreoffice-writer`. Existing `httpx`, `pydantic`, `cachetools`, `tenacity`, `pandas` stay.

### CI runtime concern (known unknown)

First-build estimate is ~17h (40K filings × ~1.5s normalize+parse). GitHub Actions free tier caps single jobs at 6h. Mitigation: shard by symbol-letter (36 parallel jobs), or do a one-time backfill on a beefier machine and let monthly CI handle incremental from then on. Decision deferred to first build — current plan is "try sharding first."

## Testing Decisions

A good test for this codebase verifies **external behavior** — what users observe — not implementation details. Tests assert on parsed data values, query outputs, and structured fields; they do not assert on internal helper function names, intermediate DataFrame shapes, or the specific regex used inside a parser. Tests use real fixture filings (committed in `tests/sample_data/*.zip`) rather than mocks, so that parser changes are validated against actual SEC documents.

### Modules to unit-test (deep, isolated, fixture-driven)

- **FinancialStatementParser** — Fixture: `tests/sample_data/PTT_2025Q4.zip`. Assert: known CapEx value (`-159,512,958,954` for FY2025 consolidated), known revenue, known total assets. Both legacy `.XLS` and modern `.XLSX` fixtures covered.
- **AuditorReportParser** — Fixtures: a clean (unqualified) opinion + one qualified opinion. Assert: `opinion_type`, `auditor_firm`, `going_concern_emphasis` correctly extracted.
- **ConceptMapper** — Pure function. Inputs: synthetic rows + minimal dictionary. Assert: aliases resolve, statement scoping prevents cross-statement false matches, unmapped rows survive with `concept=None`.
- **FilingDiscovery** — Fixture: cached HTML of one letter-page + one fs-norm response. Assert: yields expected `(symbol, source_url)` tuples; handles pagination.
- **DatasetClient** — Mock HTTP layer + DuckDB. Assert: revision resolution → correct URL, `lru_cache` deduplicates queries within a session, query results return expected schema.

### Modules covered by integration tests only

- **FormatNormalizer** (subprocess to libreoffice — slow, environment-dependent)
- **HFPublisher** (network IO)
- **FilingFetcher** (mostly IO + state file manipulation; covered by a single end-to-end smoke test)

### End-to-end smoke test

A single test that, given a fixture zip URL, runs the full pipeline (discover → fetch → normalize → parse → apply concepts → build parquets) and asserts that a query against the resulting parquet returns the expected CapEx value. Slow; runs only on push to main and in nightly CI.

### Prior art

The codebase's existing test pattern (`tests/`) is `pytest` with `unittest.mock.patch` for service-level mocks and a `tests/public_internet_tests/` directory for live-network tests gated by markers. New tests follow that convention. Fixture files live under `tests/sample_data/` alongside the existing `get_financial_sheet.json`.

## Out of Scope

- **Industry-specific concept variants** (banks, REITs, insurers, brokerages). v1 of the concept dictionary aims for the cross-industry intersection plus general-industrial-company line items. Industry-specialized concepts (e.g., `interest_income` for banks, `property_income` for REITs) deferred to v2.1 once the dictionary has been stressed against real filings.
- **Restatement handling**. When a company refiles a prior period, v1 keeps the most recent filing's values and treats the superseded filing as inactive in `filings.parquet`. A formal restatement model (with `superseded_by` chains and time-of-knowledge queries) is deferred.
- **LLM-based auditor opinion classification**. v1 uses Thai regex/keyword heuristics for `opinion_type` extraction. LLM-assisted extraction (with manual override list) is a v2.1 candidate.
- **Notes structural extraction**. v1 stores notes as raw markdown only. Parsing specific note sections (segment reporting tables, related-party transaction tables, lease maturity schedules) into structured tables is deferred.
- **56-1 / 56-2 annual report ingestion**. Those are separate document types in IDISC and contain narrative + governance + risk disclosures. Out of scope for v1; the dataset is financial-statements-only.
- **Real-time / intra-monthly freshness**. The dataset updates monthly. Users who need filings the day they're posted should use `source="live"` (Finnomena) or build their own scraper.
- **Removal of the live Finnomena and ThaiSecuritiesData source modules**. They remain in tree under `thaifin/sources/` for `source="live"`. Removal is a v3.0 candidate after measuring `source="live"` usage in the wild.
- **Authentication / private datasets / commercial tier**. The dataset is public and free.

## Further Notes

- **Concept dictionary as PR-driven artifact**: `data/concepts.csv` (CSV for easy human review) is the source of truth. CI converts it to `concepts.parquet` at build time. Each row carries `concept`, `statement`, `label_en`, `label_th`, `aliases_th` (pipe-separated), `xbrl_ref`. Contributors propose new concepts or alias additions via PR against this CSV.
- **CI coverage gate**: build fails if the % of `financial_lines` rows with `concept IS NOT NULL` drops by more than a configurable threshold (default 1pp) versus the previous published revision. Prevents concept regression.
- **Provenance traceability**: every value can be traced back to a specific filing via `filing_id` → `filings.source_url` → the exact SEC zip. The zip's sha256 is recorded so we can detect upstream re-uploads.
- **License**: Dataset published under CC-BY-4.0 (typical for derivative datasets of public regulatory disclosures). Library remains ISC. Each parquet's README on HuggingFace cites Thai SEC IDISC as the primary source and links back to the original filing URLs.
- **Migration path**: v1.x users with `source` unset will get a `DeprecationWarning` on first call for one major version cycle, then v2.0 silently defaults to `source="dataset"`.
- **Open question on dataset namespace**: published as `thaifin/financials` (requires creating the `thaifin` org on HF) or as `ninyawee/thaifin-financials` (user namespace). Org gives a cleaner researcher-facing brand but requires a one-time HF org setup. Defaulting to `thaifin/financials` in this PRD; reversible at first publish.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PRD: CI/CD pipeline to publish Thai SEC financial statements as a HuggingFace dataset, exposed via thaifin v2 #11

Problem Statement

Solution

User Stories

Implementation Decisions

Pipeline modules (CI side)

Library modules

Schema

Distribution

Library API contract

Schema migration note

Dependencies

CI runtime concern (known unknown)

Testing Decisions

Modules to unit-test (deep, isolated, fixture-driven)

Modules covered by integration tests only

End-to-end smoke test

Prior art

Out of Scope

Further Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

PRD: CI/CD pipeline to publish Thai SEC financial statements as a HuggingFace dataset, exposed via thaifin v2 #11

Description

Problem Statement

Solution

User Stories

Implementation Decisions

Pipeline modules (CI side)

Library modules

Schema

Distribution

Library API contract

Schema migration note

Dependencies

CI runtime concern (known unknown)

Testing Decisions

Modules to unit-test (deep, isolated, fixture-driven)

Modules covered by integration tests only

End-to-end smoke test

Prior art

Out of Scope

Further Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions