Summary
The texbizct (Texas Business Court) scraper has been failing since 2026-05-01 and has ingested zero opinions since 2026-05-18.
Root cause
_get_approximate_date issues a HEAD request per opinion link and parses the Last-Modified header with no status-code check and no None guard:
resp = await self.request["session"].head(url, follow_redirects=True, timeout=30)
lm = resp.headers.get("Last-Modified")
dt = parser.parse(lm) # ← lm is None → TypeError
www.txcourts.gov is behind an Azure Front Door WAF. Its 403 block page ("The request is blocked.", x-azure-ref header) carries no Last-Modified header. In prod, the GET of
the listing page still succeeds (the crash happens later, inside _process_html), but the HEAD requests to the PDFs are being blocked or served without the header. Since
the exception is raised inside _process_html, the whole scrape dies on the first affected link and nothing is ingested for the run.
The blocking appears to be escalating: from a residential IP, even the listing GET now returns 403 (with curl and with full browser headers).
Impact
- Intermittent failures May 1–18 (some runs survived: clusters created May 4, 8, 11, 13, 14, 15, 18).
- Zero opinions since 2026-05-18 (~2.5 weeks dark). At the May cadence (~2–3 opinions/week), roughly 5–8 opinions are likely missing.
- tex and texapp use the same domain but don't do HEAD requests, so they're not hit by this bug — though the WAF escalation could affect them next.
┌──────────────┬────────────┬──────────────────────────────────────────────────┐
│ date_created │ date_filed │ case_name │
├──────────────┼────────────┼──────────────────────────────────────────────────┤
│ 2026-05-18 │ 2026-05-16 │ Plains Pipeline v. Arrowhead Gulf Coast Holdings │
├──────────────┼────────────┼──────────────────────────────────────────────────┤
Sentry Issue: COURTLISTENER-CY2
TypeError: Parser must be a string or character stream, not NoneType
(7 additional frame(s) were not displayed)
...
File "cl/scrapers/management/commands/cl_scrape_opinions.py", line 514, in handle
async_to_sync(self.parse_and_scrape_site)(mod, options)
File "cl/scrapers/management/commands/cl_scrape_opinions.py", line 477, in parse_and_scrape_site
site = await mod.Site(save_response_fn=save_response).parse()
Summary
The texbizct (Texas Business Court) scraper has been failing since 2026-05-01 and has ingested zero opinions since 2026-05-18.
Root cause
_get_approximate_date issues a HEAD request per opinion link and parses the Last-Modified header with no status-code check and no None guard:
resp = await self.request["session"].head(url, follow_redirects=True, timeout=30)
lm = resp.headers.get("Last-Modified")
dt = parser.parse(lm) # ← lm is None → TypeError
www.txcourts.gov is behind an Azure Front Door WAF. Its 403 block page ("The request is blocked.", x-azure-ref header) carries no Last-Modified header. In prod, the GET of
the listing page still succeeds (the crash happens later, inside _process_html), but the HEAD requests to the PDFs are being blocked or served without the header. Since
the exception is raised inside _process_html, the whole scrape dies on the first affected link and nothing is ingested for the run.
The blocking appears to be escalating: from a residential IP, even the listing GET now returns 403 (with curl and with full browser headers).
Impact
┌──────────────┬────────────┬──────────────────────────────────────────────────┐
│ date_created │ date_filed │ case_name │
├──────────────┼────────────┼──────────────────────────────────────────────────┤
│ 2026-05-18 │ 2026-05-16 │ Plains Pipeline v. Arrowhead Gulf Coast Holdings │
├──────────────┼────────────┼──────────────────────────────────────────────────┤
Sentry Issue: COURTLISTENER-CY2