Skip to content

Extract SEC IDISC download as a standalone module + CLI (consolidate operational fixes) #21

@ninyawee

Description

@ninyawee

The SEC IDISC download responsibility is currently spread across 4 modules + a script:

The operational gotchas surfaced during the 9-stock SET100 seed backfill (commit `2205a9d`) live only in the script:

  1. IPv6 timeout — `market.sec.or.th` resolves to IPv6 from some networks; the AAAA endpoint times out. Worked around by monkey-patching `socket.getaddrinfo` to force IPv4 in `build_seed10.py`.
  2. WAF UA-banning — SEC IDISC rejects `python-httpx/*` and `thaifin/2.0` user agents after ~3-4 successful symbols, returning an F5/Imperva JS challenge page with HTTP 200 (5-45KB body). Worked around by patching `httpx.Client` to send a Chrome UA + `Accept-Language` headers.
  3. Polite delay required — `SEC_REQUEST_DELAY` env var, default 0.5s.

These belong in the library, not in a one-off script. They will hit any user who runs the existing `FilingFetcher` against more than ~3 symbols in a row.

What to build

1. Consolidate into a coherent download module

Refactor `thaifin/sources/sec_idisc/` so that all four current files cooperate as a single "SEC IDISC client" rather than four loosely-coupled helpers. Suggested public surface:

```python
from thaifin.sources.sec_idisc import IDISCClient

client = IDISCClient() # operational defaults baked in
symbols = client.list_symbols() # walks /company/listed/{0-9,A-Z}
manifest = client.discover_filings('PTT') # per-symbol fs-norm page → manifest
result = client.fetch_filing(manifest_entry) # incremental zip download
oox_path = client.normalize(result.zip_path) # libreoffice conversion if legacy
```

Keep the existing function-style API as a thin wrapper around the client for back-compat (the slice #14-#18 tests still pass).

2. Move operational gotchas into the client

  • `IDISCClient(force_ipv4=True)` — default ON; uses `socket.AF_INET` filter or DNS override
  • `IDISCClient(user_agent=...)` — default to a current Chrome UA + `Accept-Language: th,en;q=0.9`
  • `IDISCClient(min_delay_s=0.5)` — configurable polite delay between requests (overridable via `SEC_REQUEST_DELAY` env var)
  • `IDISCClient()` — detects WAF challenge responses (HTTP 200 with `/_Incapsula_Resource` or `/IFRAME-Anti-Bot` signatures in body) and raises a typed exception (`WAFChallengeError`) so callers can decide whether to back off, rotate UAs, or fail fast.

3. CLI entry point

```
uv run python -m thaifin.sec_idisc symbols --out data/symbols.csv
uv run python -m thaifin.sec_idisc discover --symbol PTT --out /tmp/ptt/manifest.json
uv run python -m thaifin.sec_idisc fetch --manifest /tmp/ptt/manifest.json --out /tmp/ptt/zips/ --state-file data/state.json
uv run python -m thaifin.sec_idisc normalize --in /tmp/ptt/zips/ --out /tmp/ptt/normalized/
uv run python -m thaifin.sec_idisc bulk --letter A --out /tmp/letter-a/ --cleanup-zips
```

Replaces the ad-hoc scripts (`scripts/build_seed10.py`, parts of `scripts/backfill.py`) with a unified surface. The `bulk` subcommand is what the monthly CI workflow (`.github/workflows/data-build.yml`) calls.

4. Decouple download from parsing

After this refactor, the pipeline becomes:

```
discover → fetch → normalize → [hand off to parser] → tag → publish
(this issue) (slice #15) (cycles)
```

A rebuild from cached zips skips the discover/fetch steps entirely, which makes dictionary cycles much faster (we already do this at the parquet-level via re-tagging — extending the same pattern to the upstream layer).

Acceptance criteria

Out of scope

Origin

Surfaced empirically during the 9-stock SET100 seed backfill (agent run on 2026-05-16). The agent had to discover both gotchas mid-run and patch them in the build script. Documenting and consolidating now so the next backfill (or anyone using `FilingFetcher` directly) doesn't re-hit them.

Refs #11, #17, #20.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions