Generated from a /grill-with-docs session. Locked design decisions captured in notes/cicd-design.md, domain language in CONTEXT.md, schema rationale in docs/adr/0001-tagged-long-format.md.
Problem Statement
Today, every call to Stock("PTT").quarter_dataframe reaches out to Finnomena's live HTTP API. That API exposes ~35 aggregated metrics (revenue, NPM, total InvestingActivities) but no balance-sheet, P&L, or cash-flow line items. Researchers and quants who want CapEx, PP&E movements, detailed cash flow breakdowns, auditor opinions, or the notes to financial statements — i.e., the data needed for any serious fundamental analysis of Thai listed companies — have no path forward through thaifin as it stands.
Worse, the library is fully coupled to live third-party endpoints. If Finnomena rate-limits, breaks an endpoint, or sunsets the service, every consumer breaks. There is no canonical, citable, version-pinnable corpus a researcher can reference in a paper or notebook.
Thai SEC publishes the full financial filings (XLS + DOC + auditor report, all four primary statements, going back to each company's IPO) for free at market.sec.or.th/public/idisc/. The data is there. There is no tool that processes it into a clean, queryable, redistributable form.
Solution
Build a CI/CD pipeline that monthly scrapes Thai SEC IDISC, parses every available filing for every SET-listed company, and publishes a versioned research-grade dataset to HuggingFace at hf.co/datasets/thaifin/financials.
The dataset is shaped as Tagged long-format: one row per (symbol, period, statement, concept, raw_label_th, value, audit_basis, consolidation, filing_id), with a sidecar concept dictionary that maps short curated IDs (capex, revenue, ebitda) to the long-tail of original Thai labels — and carries optional XBRL refs for forward compatibility with global IFRS datasets.
The thaifin library v2 grows a source parameter ("dataset" default, "live" opt-in). With source="dataset", all data reads go through DuckDB streaming over HTTP against the HuggingFace-hosted parquet — no local cache by default, with session-level memoization to avoid foot-guns and an opt-in thaifin.download_dataset() for offline use. New convenience properties (.capex, .cash_flow_statement, .auditor_report, .notes) ride on top.
End state: researchers in Python, R, Julia, or any DuckDB-capable tool can pull a pinned revision of the dataset; thaifin users get richer data and a non-breaking migration; the library no longer fails when a third-party API hiccups.
User Stories
-
As a quant researcher, I want to retrieve CapEx for any SET-listed company over the past 10+ years, so that I can compute free cash flow without subscribing to Bloomberg or SETSMART.
-
As a Python user, I want to call Stock("PTT").capex and get a pandas Series indexed by period, so that I don't have to learn the underlying schema before getting useful output.
-
As a Python user, I want Stock("PTT").quarter_dataframe to keep working when I upgrade to v2, so that my existing scripts don't break.
-
As a Python user, I want to opt back into live Finnomena data with Stock("PTT", source="live"), so that I can get fresher-than-monthly figures when I need them.
-
As a researcher writing a paper, I want to cite a specific dataset revision (e.g., thaifin/financials @ 2026.05), so that my analysis is reproducible.
-
As a non-Python researcher (R / Julia / DuckDB CLI / Observable), I want to query the dataset directly from HuggingFace URLs, so that I can use it without installing thaifin.
-
As a financial analyst, I want to filter rows by consolidation = 'consolidated' vs 'company', so that I can compare parent-only and group performance.
-
As a financial analyst, I want to filter by audit_basis = 'audited', so that I can exclude reviewed-only quarterly numbers when rigor matters.
-
As a credit analyst, I want to read the auditor's opinion type for any filing, so that I can flag companies with qualified opinions or going-concern emphases of matter.
-
As a credit analyst, I want a boolean going_concern_emphasis column on auditor_reports, so that I can screen the whole market for distress signals in one query.
-
As a researcher, I want full-text notes (NOTES.DOC) preserved as markdown alongside the numeric data, so that I can search related-party transactions, segment reporting, or accounting policy changes.
-
As a researcher comparing Thai companies to global peers, I want each curated concept to optionally carry an IFRS XBRL ref, so that I can cross-walk to international datasets.
-
As a contributor, I want to propose new concepts or fix label mappings via a PR against data/concepts.csv, so that the dictionary evolves without becoming a maintainer bottleneck.
-
As a contributor, I want CI to fail when concept coverage drops vs the previous build, so that mapping regressions are caught before they ship.
-
As a maintainer, I want the pipeline to incrementally fetch only new filings (diffed against a committed state.json), so that monthly builds don't re-download 20 GB every time.
-
As a maintainer, I want the pipeline to be triggerable manually via workflow_dispatch, so that I can rebuild on demand for off-cycle filings or hotfixes.
-
As a maintainer, I want failed builds to leave the previous HuggingFace revision intact, so that a parser crash never silently corrupts the public dataset.
-
As a maintainer, I want each filing's source_url and source_sha256 recorded in filings.parquet, so that any numeric anomaly can be traced back to the exact regulatory document.
-
As an offline user (CI sandbox / spotty WiFi / corporate firewall), I want thaifin.download_dataset() to bulk-fetch the parquet bundle once, so that I can work without network for the rest of the session.
-
As an interactive notebook user, I want repeated calls like Stock("PTT").capex and Stock("PTT").revenue to hit the network only on the first call within a session, so that exploratory analysis stays fast.
-
As a library user, I want to pin a specific dataset revision via thaifin.set_data_revision("2026.05"), so that my notebook produces the same numbers every time I rerun it.
-
As an investor scanning across companies, I want to query e.g. "all SET-listed companies with CapEx > 10B THB in FY2025" in a single DuckDB query against the parquet, so that I don't have to write 800 individual API calls.
-
As a library user, I want raw Thai labels (raw_label_th) preserved on every row, so that I can audit a concept mapping or dive into long-tail lines the curated dictionary doesn't cover.
-
As a contributor curating the concept dictionary, I want each concept to be assignable to a specific statement (BS/IS/CF/EQ), so that revenue never accidentally matches a balance-sheet row.
-
As a library user, I want Stock("PTT").income_statement, .balance_sheet, .cash_flow_statement properties that return pivoted (wide) DataFrames, so that I don't have to learn the long-format schema for basic use.
-
As a maintainer parsing decade-old filings, I want a single parser code path regardless of whether the source filing is legacy .XLS/.DOC or modern .XLSX/.DOCX, so that I don't maintain two divergent parsers as edge cases accumulate.
-
As a library user pinning to v2, I want a clear CHANGELOG note explaining the column-name realignment between v1.x Finnomena output and the new dataset schema, so that I can migrate my code in one sitting.
Implementation Decisions
Pipeline modules (CI side)
The pipeline is sliced into deep, independently testable modules. Each has a narrow interface and encapsulates significant logic behind it.
-
FilingDiscovery — Scrapes SEC IDISC. Two-stage HTTP crawl: (1) walk /public/idisc/th/company/listed/{0-9,A-Z} for the symbol universe; (2) per symbol, fetch /public/idisc/th/Viewmore/fs-norm?searchSymbol=<S> for filing zip URLs. Emits a manifest of (symbol, filing_id, source_url, period, audit_basis, consolidation, filed_at). The uniqueIDReference parameter is not required — searchSymbol alone is sufficient (verified). Pure function over HTTP; testable with cached HTML fixtures.
-
FilingFetcher — Incremental zip downloader. Inputs: manifest + state.json (filing_id → sha256). Outputs: paths to newly-fetched zips + updated state. Polite rate-limiting; resumable.
-
FormatNormalizer — Wraps libreoffice --headless --convert-to xlsx,docx. Inputs: legacy .XLS/.DOC. Outputs: modern .XLSX/.DOCX. Downstream parsers never see legacy formats. Decision rationale: SEC is mid-transition (PTT Feb-2025 was .XLS, COM7 May-2025 was .XLSX); normalizing at ingest collapses two parser paths into one.
-
FinancialStatementParser — Deep module. Inputs: path to FINANCIAL_STATEMENTS.XLSX. Outputs: list of (statement, raw_label_th, value, consolidation) rows. Encapsulates: sheet → statement classification (BS-Asset/Liability/Equity, PL_Accum, OCI_Accum, Cash flow), multi-period column extraction (each XLS contains 4 period columns: consolidated current+prior, company current+prior), value parsing (handles Thai numeric formatting, sign conventions, blanks), and per-statement quirks (e.g., negative cash outflows in CF section).
-
AuditorReportParser — Deep module. Inputs: path to AUDITOR_REPORT.DOCX. Outputs: structured row (auditor_firm, opinion_type, signing_date, signing_partner, going_concern_emphasis, raw_text_md). Encapsulates Thai-language heuristics for the four opinion types (unqualified / qualified / adverse / disclaimer) and going-concern paragraph detection.
-
NotesParser — Shallow. Inputs: path to NOTES.DOCX. Outputs: markdown text only (no per-section structural extraction in v1).
-
ConceptMapper — Deep, pure module. Inputs: list of (statement, raw_label_th, ...) rows + concepts.parquet. Outputs: rows enriched with concept column (nullable for unmapped). Encapsulates alias resolution and Thai-label normalization (whitespace, parentheticals).
-
DatasetBuilder — Composition module. Joins parser outputs into the 5 published parquet tables (financial_lines, concepts, auditor_reports, notes_text, filings). Shallow; mostly orchestration.
-
HFPublisher — Wraps huggingface_hub upload + git-tag revision. Shallow.
Library modules
-
DatasetClient — Deep, pure-interface module on the library side. Encapsulates: HF URL composition from (revision, table), DuckDB connection management, parquet predicate-pushdown queries, session-level lru_cache memoization. Used by all dataset-backed Stock methods.
-
StockSourceRouter — Shallow dispatch layer. Stock.__init__ accepts source="dataset"|"live"; methods like quarter_dataframe delegate to DatasetClient or to the existing Finnomena/ThaiSecuritiesData services accordingly. New methods (.capex, .notes, .auditor_report, .income_statement, etc.) are dataset-only and raise a clear error if source="live".
Schema
Five published parquet tables. Detailed column specs live in notes/cicd-design.md; the load-bearing summary:
financial_lines.parquet — long-format facts. PK is composite (symbol, period, statement, concept, consolidation, audit_basis, filing_id). concept is nullable.
concepts.parquet — ~120 curated concepts × (statement, label_en, label_th, aliases_th, xbrl_ref). The editorial product. Sourced from a human-maintained data/concepts.csv in the repo; converted to parquet at build time.
auditor_reports.parquet — one row per filing × structured fields + raw_text_md.
notes_text.parquet — one row per filing × raw_text_md only.
filings.parquet — provenance and incremental state: filing_id, source URL, sha256, fetch timestamp.
Distribution
- Dataset host:
hf.co/datasets/thaifin/financials.
- Revisions tagged per build:
2026.05, 2026.06, … (year.month).
- Cadence:
cron: '0 2 1 * *' (1st of each month, 02:00 UTC). workflow_dispatch always available.
- The library pins a default revision per library release; users override with
thaifin.set_data_revision(...).
Library API contract
# Default — streams from the published dataset
Stock("PTT").capex # pd.Series indexed by period
Stock("PTT").quarter_dataframe # wide-pivoted; column names follow the new schema
Stock("PTT").cash_flow_statement # wide DataFrame
Stock("PTT").auditor_report # DataFrame, one row per period
Stock("PTT").notes # DataFrame with raw_text_md per period
# Opt back to live (legacy v1.x behavior)
Stock("PTT", source="live").quarter_dataframe # hits Finnomena
# Pin a specific dataset version
Stock("PTT", revision="2026.05").capex
thaifin.set_data_revision("2026.05") # module-wide default
# Offline / sandboxed
thaifin.download_dataset() # one-time bulk fetch into ~/.cache/thaifin/<revision>/
# Power-user query
thaifin.financial_lines() # pl.LazyFrame against the active revision
Schema migration note
The new quarter_dataframe column names align with the tagged schema, not v1.x Finnomena names. This is a breaking change at the column level even though the API shape is preserved. The CHANGELOG will provide a one-page rename map.
Dependencies
Add: duckdb, huggingface_hub, python-docx, openpyxl (already dev-dep, promote to runtime), polars. System-side in CI only: libreoffice-core, libreoffice-calc, libreoffice-writer. Existing httpx, pydantic, cachetools, tenacity, pandas stay.
CI runtime concern (known unknown)
First-build estimate is ~17h (40K filings × ~1.5s normalize+parse). GitHub Actions free tier caps single jobs at 6h. Mitigation: shard by symbol-letter (36 parallel jobs), or do a one-time backfill on a beefier machine and let monthly CI handle incremental from then on. Decision deferred to first build — current plan is "try sharding first."
Testing Decisions
A good test for this codebase verifies external behavior — what users observe — not implementation details. Tests assert on parsed data values, query outputs, and structured fields; they do not assert on internal helper function names, intermediate DataFrame shapes, or the specific regex used inside a parser. Tests use real fixture filings (committed in tests/sample_data/*.zip) rather than mocks, so that parser changes are validated against actual SEC documents.
Modules to unit-test (deep, isolated, fixture-driven)
- FinancialStatementParser — Fixture:
tests/sample_data/PTT_2025Q4.zip. Assert: known CapEx value (-159,512,958,954 for FY2025 consolidated), known revenue, known total assets. Both legacy .XLS and modern .XLSX fixtures covered.
- AuditorReportParser — Fixtures: a clean (unqualified) opinion + one qualified opinion. Assert:
opinion_type, auditor_firm, going_concern_emphasis correctly extracted.
- ConceptMapper — Pure function. Inputs: synthetic rows + minimal dictionary. Assert: aliases resolve, statement scoping prevents cross-statement false matches, unmapped rows survive with
concept=None.
- FilingDiscovery — Fixture: cached HTML of one letter-page + one fs-norm response. Assert: yields expected
(symbol, source_url) tuples; handles pagination.
- DatasetClient — Mock HTTP layer + DuckDB. Assert: revision resolution → correct URL,
lru_cache deduplicates queries within a session, query results return expected schema.
Modules covered by integration tests only
- FormatNormalizer (subprocess to libreoffice — slow, environment-dependent)
- HFPublisher (network IO)
- FilingFetcher (mostly IO + state file manipulation; covered by a single end-to-end smoke test)
End-to-end smoke test
A single test that, given a fixture zip URL, runs the full pipeline (discover → fetch → normalize → parse → apply concepts → build parquets) and asserts that a query against the resulting parquet returns the expected CapEx value. Slow; runs only on push to main and in nightly CI.
Prior art
The codebase's existing test pattern (tests/) is pytest with unittest.mock.patch for service-level mocks and a tests/public_internet_tests/ directory for live-network tests gated by markers. New tests follow that convention. Fixture files live under tests/sample_data/ alongside the existing get_financial_sheet.json.
Out of Scope
- Industry-specific concept variants (banks, REITs, insurers, brokerages). v1 of the concept dictionary aims for the cross-industry intersection plus general-industrial-company line items. Industry-specialized concepts (e.g.,
interest_income for banks, property_income for REITs) deferred to v2.1 once the dictionary has been stressed against real filings.
- Restatement handling. When a company refiles a prior period, v1 keeps the most recent filing's values and treats the superseded filing as inactive in
filings.parquet. A formal restatement model (with superseded_by chains and time-of-knowledge queries) is deferred.
- LLM-based auditor opinion classification. v1 uses Thai regex/keyword heuristics for
opinion_type extraction. LLM-assisted extraction (with manual override list) is a v2.1 candidate.
- Notes structural extraction. v1 stores notes as raw markdown only. Parsing specific note sections (segment reporting tables, related-party transaction tables, lease maturity schedules) into structured tables is deferred.
- 56-1 / 56-2 annual report ingestion. Those are separate document types in IDISC and contain narrative + governance + risk disclosures. Out of scope for v1; the dataset is financial-statements-only.
- Real-time / intra-monthly freshness. The dataset updates monthly. Users who need filings the day they're posted should use
source="live" (Finnomena) or build their own scraper.
- Removal of the live Finnomena and ThaiSecuritiesData source modules. They remain in tree under
thaifin/sources/ for source="live". Removal is a v3.0 candidate after measuring source="live" usage in the wild.
- Authentication / private datasets / commercial tier. The dataset is public and free.
Further Notes
- Concept dictionary as PR-driven artifact:
data/concepts.csv (CSV for easy human review) is the source of truth. CI converts it to concepts.parquet at build time. Each row carries concept, statement, label_en, label_th, aliases_th (pipe-separated), xbrl_ref. Contributors propose new concepts or alias additions via PR against this CSV.
- CI coverage gate: build fails if the % of
financial_lines rows with concept IS NOT NULL drops by more than a configurable threshold (default 1pp) versus the previous published revision. Prevents concept regression.
- Provenance traceability: every value can be traced back to a specific filing via
filing_id → filings.source_url → the exact SEC zip. The zip's sha256 is recorded so we can detect upstream re-uploads.
- License: Dataset published under CC-BY-4.0 (typical for derivative datasets of public regulatory disclosures). Library remains ISC. Each parquet's README on HuggingFace cites Thai SEC IDISC as the primary source and links back to the original filing URLs.
- Migration path: v1.x users with
source unset will get a DeprecationWarning on first call for one major version cycle, then v2.0 silently defaults to source="dataset".
- Open question on dataset namespace: published as
thaifin/financials (requires creating the thaifin org on HF) or as ninyawee/thaifin-financials (user namespace). Org gives a cleaner researcher-facing brand but requires a one-time HF org setup. Defaulting to thaifin/financials in this PRD; reversible at first publish.
Problem Statement
Today, every call to
Stock("PTT").quarter_dataframereaches out to Finnomena's live HTTP API. That API exposes ~35 aggregated metrics (revenue, NPM, totalInvestingActivities) but no balance-sheet, P&L, or cash-flow line items. Researchers and quants who want CapEx, PP&E movements, detailed cash flow breakdowns, auditor opinions, or the notes to financial statements — i.e., the data needed for any serious fundamental analysis of Thai listed companies — have no path forward throughthaifinas it stands.Worse, the library is fully coupled to live third-party endpoints. If Finnomena rate-limits, breaks an endpoint, or sunsets the service, every consumer breaks. There is no canonical, citable, version-pinnable corpus a researcher can reference in a paper or notebook.
Thai SEC publishes the full financial filings (XLS + DOC + auditor report, all four primary statements, going back to each company's IPO) for free at
market.sec.or.th/public/idisc/. The data is there. There is no tool that processes it into a clean, queryable, redistributable form.Solution
Build a CI/CD pipeline that monthly scrapes Thai SEC IDISC, parses every available filing for every SET-listed company, and publishes a versioned research-grade dataset to HuggingFace at
hf.co/datasets/thaifin/financials.The dataset is shaped as Tagged long-format: one row per
(symbol, period, statement, concept, raw_label_th, value, audit_basis, consolidation, filing_id), with a sidecar concept dictionary that maps short curated IDs (capex,revenue,ebitda) to the long-tail of original Thai labels — and carries optional XBRL refs for forward compatibility with global IFRS datasets.The
thaifinlibrary v2 grows asourceparameter ("dataset"default,"live"opt-in). Withsource="dataset", all data reads go through DuckDB streaming over HTTP against the HuggingFace-hosted parquet — no local cache by default, with session-level memoization to avoid foot-guns and an opt-inthaifin.download_dataset()for offline use. New convenience properties (.capex,.cash_flow_statement,.auditor_report,.notes) ride on top.End state: researchers in Python, R, Julia, or any DuckDB-capable tool can pull a pinned revision of the dataset;
thaifinusers get richer data and a non-breaking migration; the library no longer fails when a third-party API hiccups.User Stories
As a quant researcher, I want to retrieve CapEx for any SET-listed company over the past 10+ years, so that I can compute free cash flow without subscribing to Bloomberg or SETSMART.
As a Python user, I want to call
Stock("PTT").capexand get a pandas Series indexed by period, so that I don't have to learn the underlying schema before getting useful output.As a Python user, I want
Stock("PTT").quarter_dataframeto keep working when I upgrade to v2, so that my existing scripts don't break.As a Python user, I want to opt back into live Finnomena data with
Stock("PTT", source="live"), so that I can get fresher-than-monthly figures when I need them.As a researcher writing a paper, I want to cite a specific dataset revision (e.g.,
thaifin/financials @ 2026.05), so that my analysis is reproducible.As a non-Python researcher (R / Julia / DuckDB CLI / Observable), I want to query the dataset directly from HuggingFace URLs, so that I can use it without installing
thaifin.As a financial analyst, I want to filter rows by
consolidation = 'consolidated'vs'company', so that I can compare parent-only and group performance.As a financial analyst, I want to filter by
audit_basis = 'audited', so that I can exclude reviewed-only quarterly numbers when rigor matters.As a credit analyst, I want to read the auditor's opinion type for any filing, so that I can flag companies with qualified opinions or going-concern emphases of matter.
As a credit analyst, I want a boolean
going_concern_emphasiscolumn onauditor_reports, so that I can screen the whole market for distress signals in one query.As a researcher, I want full-text notes (NOTES.DOC) preserved as markdown alongside the numeric data, so that I can search related-party transactions, segment reporting, or accounting policy changes.
As a researcher comparing Thai companies to global peers, I want each curated concept to optionally carry an IFRS XBRL ref, so that I can cross-walk to international datasets.
As a contributor, I want to propose new concepts or fix label mappings via a PR against
data/concepts.csv, so that the dictionary evolves without becoming a maintainer bottleneck.As a contributor, I want CI to fail when concept coverage drops vs the previous build, so that mapping regressions are caught before they ship.
As a maintainer, I want the pipeline to incrementally fetch only new filings (diffed against a committed
state.json), so that monthly builds don't re-download 20 GB every time.As a maintainer, I want the pipeline to be triggerable manually via
workflow_dispatch, so that I can rebuild on demand for off-cycle filings or hotfixes.As a maintainer, I want failed builds to leave the previous HuggingFace revision intact, so that a parser crash never silently corrupts the public dataset.
As a maintainer, I want each filing's
source_urlandsource_sha256recorded infilings.parquet, so that any numeric anomaly can be traced back to the exact regulatory document.As an offline user (CI sandbox / spotty WiFi / corporate firewall), I want
thaifin.download_dataset()to bulk-fetch the parquet bundle once, so that I can work without network for the rest of the session.As an interactive notebook user, I want repeated calls like
Stock("PTT").capexandStock("PTT").revenueto hit the network only on the first call within a session, so that exploratory analysis stays fast.As a library user, I want to pin a specific dataset revision via
thaifin.set_data_revision("2026.05"), so that my notebook produces the same numbers every time I rerun it.As an investor scanning across companies, I want to query e.g. "all SET-listed companies with CapEx > 10B THB in FY2025" in a single DuckDB query against the parquet, so that I don't have to write 800 individual API calls.
As a library user, I want raw Thai labels (
raw_label_th) preserved on every row, so that I can audit a concept mapping or dive into long-tail lines the curated dictionary doesn't cover.As a contributor curating the concept dictionary, I want each concept to be assignable to a specific
statement(BS/IS/CF/EQ), so thatrevenuenever accidentally matches a balance-sheet row.As a library user, I want
Stock("PTT").income_statement,.balance_sheet,.cash_flow_statementproperties that return pivoted (wide) DataFrames, so that I don't have to learn the long-format schema for basic use.As a maintainer parsing decade-old filings, I want a single parser code path regardless of whether the source filing is legacy
.XLS/.DOCor modern.XLSX/.DOCX, so that I don't maintain two divergent parsers as edge cases accumulate.As a library user pinning to v2, I want a clear CHANGELOG note explaining the column-name realignment between v1.x Finnomena output and the new dataset schema, so that I can migrate my code in one sitting.
Implementation Decisions
Pipeline modules (CI side)
The pipeline is sliced into deep, independently testable modules. Each has a narrow interface and encapsulates significant logic behind it.
FilingDiscovery — Scrapes SEC IDISC. Two-stage HTTP crawl: (1) walk
/public/idisc/th/company/listed/{0-9,A-Z}for the symbol universe; (2) per symbol, fetch/public/idisc/th/Viewmore/fs-norm?searchSymbol=<S>for filing zip URLs. Emits a manifest of(symbol, filing_id, source_url, period, audit_basis, consolidation, filed_at). TheuniqueIDReferenceparameter is not required —searchSymbolalone is sufficient (verified). Pure function over HTTP; testable with cached HTML fixtures.FilingFetcher — Incremental zip downloader. Inputs: manifest +
state.json(filing_id → sha256). Outputs: paths to newly-fetched zips + updated state. Polite rate-limiting; resumable.FormatNormalizer — Wraps
libreoffice --headless --convert-to xlsx,docx. Inputs: legacy.XLS/.DOC. Outputs: modern.XLSX/.DOCX. Downstream parsers never see legacy formats. Decision rationale: SEC is mid-transition (PTT Feb-2025 was.XLS, COM7 May-2025 was.XLSX); normalizing at ingest collapses two parser paths into one.FinancialStatementParser — Deep module. Inputs: path to
FINANCIAL_STATEMENTS.XLSX. Outputs: list of(statement, raw_label_th, value, consolidation)rows. Encapsulates: sheet → statement classification (BS-Asset/Liability/Equity, PL_Accum, OCI_Accum, Cash flow), multi-period column extraction (each XLS contains 4 period columns: consolidated current+prior, company current+prior), value parsing (handles Thai numeric formatting, sign conventions, blanks), and per-statement quirks (e.g., negative cash outflows in CF section).AuditorReportParser — Deep module. Inputs: path to
AUDITOR_REPORT.DOCX. Outputs: structured row(auditor_firm, opinion_type, signing_date, signing_partner, going_concern_emphasis, raw_text_md). Encapsulates Thai-language heuristics for the four opinion types (unqualified/qualified/adverse/disclaimer) and going-concern paragraph detection.NotesParser — Shallow. Inputs: path to
NOTES.DOCX. Outputs: markdown text only (no per-section structural extraction in v1).ConceptMapper — Deep, pure module. Inputs: list of
(statement, raw_label_th, ...)rows +concepts.parquet. Outputs: rows enriched withconceptcolumn (nullable for unmapped). Encapsulates alias resolution and Thai-label normalization (whitespace, parentheticals).DatasetBuilder — Composition module. Joins parser outputs into the 5 published parquet tables (
financial_lines,concepts,auditor_reports,notes_text,filings). Shallow; mostly orchestration.HFPublisher — Wraps
huggingface_hubupload + git-tag revision. Shallow.Library modules
DatasetClient — Deep, pure-interface module on the library side. Encapsulates: HF URL composition from
(revision, table), DuckDB connection management, parquet predicate-pushdown queries, session-levellru_cachememoization. Used by all dataset-backedStockmethods.StockSourceRouter — Shallow dispatch layer.
Stock.__init__acceptssource="dataset"|"live"; methods likequarter_dataframedelegate toDatasetClientor to the existingFinnomena/ThaiSecuritiesDataservices accordingly. New methods (.capex,.notes,.auditor_report,.income_statement, etc.) are dataset-only and raise a clear error ifsource="live".Schema
Five published parquet tables. Detailed column specs live in
notes/cicd-design.md; the load-bearing summary:financial_lines.parquet— long-format facts. PK is composite(symbol, period, statement, concept, consolidation, audit_basis, filing_id).conceptis nullable.concepts.parquet— ~120 curated concepts ×(statement, label_en, label_th, aliases_th, xbrl_ref). The editorial product. Sourced from a human-maintaineddata/concepts.csvin the repo; converted to parquet at build time.auditor_reports.parquet— one row per filing × structured fields +raw_text_md.notes_text.parquet— one row per filing ×raw_text_mdonly.filings.parquet— provenance and incremental state:filing_id, source URL, sha256, fetch timestamp.Distribution
hf.co/datasets/thaifin/financials.2026.05,2026.06, … (year.month).cron: '0 2 1 * *'(1st of each month, 02:00 UTC).workflow_dispatchalways available.thaifin.set_data_revision(...).Library API contract
Schema migration note
The new
quarter_dataframecolumn names align with the tagged schema, not v1.x Finnomena names. This is a breaking change at the column level even though the API shape is preserved. The CHANGELOG will provide a one-page rename map.Dependencies
Add:
duckdb,huggingface_hub,python-docx,openpyxl(already dev-dep, promote to runtime),polars. System-side in CI only:libreoffice-core,libreoffice-calc,libreoffice-writer. Existinghttpx,pydantic,cachetools,tenacity,pandasstay.CI runtime concern (known unknown)
First-build estimate is ~17h (40K filings × ~1.5s normalize+parse). GitHub Actions free tier caps single jobs at 6h. Mitigation: shard by symbol-letter (36 parallel jobs), or do a one-time backfill on a beefier machine and let monthly CI handle incremental from then on. Decision deferred to first build — current plan is "try sharding first."
Testing Decisions
A good test for this codebase verifies external behavior — what users observe — not implementation details. Tests assert on parsed data values, query outputs, and structured fields; they do not assert on internal helper function names, intermediate DataFrame shapes, or the specific regex used inside a parser. Tests use real fixture filings (committed in
tests/sample_data/*.zip) rather than mocks, so that parser changes are validated against actual SEC documents.Modules to unit-test (deep, isolated, fixture-driven)
tests/sample_data/PTT_2025Q4.zip. Assert: known CapEx value (-159,512,958,954for FY2025 consolidated), known revenue, known total assets. Both legacy.XLSand modern.XLSXfixtures covered.opinion_type,auditor_firm,going_concern_emphasiscorrectly extracted.concept=None.(symbol, source_url)tuples; handles pagination.lru_cachededuplicates queries within a session, query results return expected schema.Modules covered by integration tests only
End-to-end smoke test
A single test that, given a fixture zip URL, runs the full pipeline (discover → fetch → normalize → parse → apply concepts → build parquets) and asserts that a query against the resulting parquet returns the expected CapEx value. Slow; runs only on push to main and in nightly CI.
Prior art
The codebase's existing test pattern (
tests/) ispytestwithunittest.mock.patchfor service-level mocks and atests/public_internet_tests/directory for live-network tests gated by markers. New tests follow that convention. Fixture files live undertests/sample_data/alongside the existingget_financial_sheet.json.Out of Scope
interest_incomefor banks,property_incomefor REITs) deferred to v2.1 once the dictionary has been stressed against real filings.filings.parquet. A formal restatement model (withsuperseded_bychains and time-of-knowledge queries) is deferred.opinion_typeextraction. LLM-assisted extraction (with manual override list) is a v2.1 candidate.source="live"(Finnomena) or build their own scraper.thaifin/sources/forsource="live". Removal is a v3.0 candidate after measuringsource="live"usage in the wild.Further Notes
data/concepts.csv(CSV for easy human review) is the source of truth. CI converts it toconcepts.parquetat build time. Each row carriesconcept,statement,label_en,label_th,aliases_th(pipe-separated),xbrl_ref. Contributors propose new concepts or alias additions via PR against this CSV.financial_linesrows withconcept IS NOT NULLdrops by more than a configurable threshold (default 1pp) versus the previous published revision. Prevents concept regression.filing_id→filings.source_url→ the exact SEC zip. The zip's sha256 is recorded so we can detect upstream re-uploads.sourceunset will get aDeprecationWarningon first call for one major version cycle, then v2.0 silently defaults tosource="dataset".thaifin/financials(requires creating thethaifinorg on HF) or asninyawee/thaifin-financials(user namespace). Org gives a cleaner researcher-facing brand but requires a one-time HF org setup. Defaulting tothaifin/financialsin this PRD; reversible at first publish.