You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`scripts/build_seed10.py` (the seed-10 backfill driver — has the operational fixes the modules don't)
The operational gotchas surfaced during the 9-stock SET100 seed backfill (commit `2205a9d`) live only in the script:
IPv6 timeout — `market.sec.or.th` resolves to IPv6 from some networks; the AAAA endpoint times out. Worked around by monkey-patching `socket.getaddrinfo` to force IPv4 in `build_seed10.py`.
WAF UA-banning — SEC IDISC rejects `python-httpx/*` and `thaifin/2.0` user agents after ~3-4 successful symbols, returning an F5/Imperva JS challenge page with HTTP 200 (5-45KB body). Worked around by patching `httpx.Client` to send a Chrome UA + `Accept-Language` headers.
Polite delay required — `SEC_REQUEST_DELAY` env var, default 0.5s.
These belong in the library, not in a one-off script. They will hit any user who runs the existing `FilingFetcher` against more than ~3 symbols in a row.
What to build
1. Consolidate into a coherent download module
Refactor `thaifin/sources/sec_idisc/` so that all four current files cooperate as a single "SEC IDISC client" rather than four loosely-coupled helpers. Suggested public surface:
```python
from thaifin.sources.sec_idisc import IDISCClient
client = IDISCClient() # operational defaults baked in
symbols = client.list_symbols() # walks /company/listed/{0-9,A-Z}
manifest = client.discover_filings('PTT') # per-symbol fs-norm page → manifest
result = client.fetch_filing(manifest_entry) # incremental zip download
oox_path = client.normalize(result.zip_path) # libreoffice conversion if legacy
```
Keep the existing function-style API as a thin wrapper around the client for back-compat (the slice #14-#18 tests still pass).
2. Move operational gotchas into the client
`IDISCClient(force_ipv4=True)` — default ON; uses `socket.AF_INET` filter or DNS override
`IDISCClient(user_agent=...)` — default to a current Chrome UA + `Accept-Language: th,en;q=0.9`
`IDISCClient(min_delay_s=0.5)` — configurable polite delay between requests (overridable via `SEC_REQUEST_DELAY` env var)
`IDISCClient()` — detects WAF challenge responses (HTTP 200 with `/_Incapsula_Resource` or `/IFRAME-Anti-Bot` signatures in body) and raises a typed exception (`WAFChallengeError`) so callers can decide whether to back off, rotate UAs, or fail fast.
3. CLI entry point
```
uv run python -m thaifin.sec_idisc symbols --out data/symbols.csv
uv run python -m thaifin.sec_idisc discover --symbol PTT --out /tmp/ptt/manifest.json
uv run python -m thaifin.sec_idisc fetch --manifest /tmp/ptt/manifest.json --out /tmp/ptt/zips/ --state-file data/state.json
uv run python -m thaifin.sec_idisc normalize --in /tmp/ptt/zips/ --out /tmp/ptt/normalized/
uv run python -m thaifin.sec_idisc bulk --letter A --out /tmp/letter-a/ --cleanup-zips
```
Replaces the ad-hoc scripts (`scripts/build_seed10.py`, parts of `scripts/backfill.py`) with a unified surface. The `bulk` subcommand is what the monthly CI workflow (`.github/workflows/data-build.yml`) calls.
4. Decouple download from parsing
After this refactor, the pipeline becomes:
```
discover → fetch → normalize → [hand off to parser] → tag → publish
(this issue) (slice #15) (cycles)
```
A rebuild from cached zips skips the discover/fetch steps entirely, which makes dictionary cycles much faster (we already do this at the parquet-level via re-tagging — extending the same pattern to the upstream layer).
Dictionary expansion — that's the iteration loop, not download.
The Thai SEC SET100 official-list endpoint — currently blocked by Incapsula and there's no public CSV; punt for now.
Origin
Surfaced empirically during the 9-stock SET100 seed backfill (agent run on 2026-05-16). The agent had to discover both gotchas mid-run and patch them in the build script. Documenting and consolidating now so the next backfill (or anyone using `FilingFetcher` directly) doesn't re-hit them.
The SEC IDISC download responsibility is currently spread across 4 modules + a script:
The operational gotchas surfaced during the 9-stock SET100 seed backfill (commit `2205a9d`) live only in the script:
These belong in the library, not in a one-off script. They will hit any user who runs the existing `FilingFetcher` against more than ~3 symbols in a row.
What to build
1. Consolidate into a coherent download module
Refactor `thaifin/sources/sec_idisc/` so that all four current files cooperate as a single "SEC IDISC client" rather than four loosely-coupled helpers. Suggested public surface:
```python
from thaifin.sources.sec_idisc import IDISCClient
client = IDISCClient() # operational defaults baked in
symbols = client.list_symbols() # walks /company/listed/{0-9,A-Z}
manifest = client.discover_filings('PTT') # per-symbol fs-norm page → manifest
result = client.fetch_filing(manifest_entry) # incremental zip download
oox_path = client.normalize(result.zip_path) # libreoffice conversion if legacy
```
Keep the existing function-style API as a thin wrapper around the client for back-compat (the slice #14-#18 tests still pass).
2. Move operational gotchas into the client
3. CLI entry point
```
uv run python -m thaifin.sec_idisc symbols --out data/symbols.csv
uv run python -m thaifin.sec_idisc discover --symbol PTT --out /tmp/ptt/manifest.json
uv run python -m thaifin.sec_idisc fetch --manifest /tmp/ptt/manifest.json --out /tmp/ptt/zips/ --state-file data/state.json
uv run python -m thaifin.sec_idisc normalize --in /tmp/ptt/zips/ --out /tmp/ptt/normalized/
uv run python -m thaifin.sec_idisc bulk --letter A --out /tmp/letter-a/ --cleanup-zips
```
Replaces the ad-hoc scripts (`scripts/build_seed10.py`, parts of `scripts/backfill.py`) with a unified surface. The `bulk` subcommand is what the monthly CI workflow (`.github/workflows/data-build.yml`) calls.
4. Decouple download from parsing
After this refactor, the pipeline becomes:
```
discover → fetch → normalize → [hand off to parser] → tag → publish
(this issue) (slice #15) (cycles)
```
A rebuild from cached zips skips the discover/fetch steps entirely, which makes dictionary cycles much faster (we already do this at the parquet-level via re-tagging — extending the same pattern to the upstream layer).
Acceptance criteria
Out of scope
Origin
Surfaced empirically during the 9-stock SET100 seed backfill (agent run on 2026-05-16). The agent had to discover both gotchas mid-run and patch them in the build script. Documenting and consolidating now so the next backfill (or anyone using `FilingFetcher` directly) doesn't re-hit them.
Refs #11, #17, #20.