Skip to content

Improve link checking: stop suppressing status 0 (dead links) while keeping bot-block/rate-limit noise out #2110

Description

@DaveSkender

Problem

Our broken-link check (Website URLs workflow + docs/.vitepress/test-links.sh) runs HTML-Proofer with a blanket --ignore-status-codes list that includes 0. Status 0 is HTML-Proofer's catch-all for "couldn't get a real HTTP response" — DNS resolution failure, TLS handshake failure, connection refused, and timeouts. Suppressing it means we silently pass genuinely dead links, which defeats the entire purpose of the check.

Concrete examples found while triaging PR #2106:

  • https://v2.dotnet.stockindicators.devno DNS record (returns status 0). Currently on every page via the "Legacy docs (v2)" nav item and silently passing.
  • https://school.stockcharts.com/doku.php?id=technical_indicators:kaufman_s_adaptive_moving_averagedead (status 0). Was passing only because 0 is exempted.

Current config (restored in PR #2106 to get CI green short-term):

--ignore-status-codes "0,302,402,403,406,408,415,429,502,503,999"
--ignore-urls "/fonts.gstatic.com/,/github\.com\/DaveSkender\/Stock\.Indicators\/(blob|tree)\//"

Goal

Catch legitimately dead links again while not flapping on the noise that forced the blanket exemptions in the first place:

  • Keep suppressing genuine bot-blocking / rate-limiting that we can't fix: 401/402/403/406/429/999, and the GitHub 429s on auto-generated edit/blob/tree/discussions links. (We originally exempted the GitHub Discussion links specifically due to 429 rate-limiting.)
  • Stop blanket-suppressing 0 so DNS/dead-host failures surface again.

Tension to solve

0 is overloaded: it covers both "this host is dead" (want to fail) and "the request timed out / TLS hiccup / transient network" (don't want to flake). Any solution needs to separate these.

Candidate approaches

  1. Per-host/per-URL allowlist instead of blanket status codes. Drop 0 from --ignore-status-codes; explicitly --ignore-urls the known bot-blocking domains (investopedia, medium, coinbase, codeburst, jstor, forex-station, etc.) and the GitHub auto-generated paths. Pros: dead hosts surface. Cons: allowlist needs occasional maintenance; timeouts on otherwise-good hosts can still flake.
  2. Add retry/backoff + timeout tuning (HTML-Proofer --hydra/typhoeus options, or wrap with retries) so transient 0s self-heal, then fail on persistent 0. Reduces flake from genuine-but-slow hosts.
  3. Switch tools to lychee. Native retry, response caching, per-status --accept, regex excludes, and a maintained excludes set for bot-blockers. Generally better signal-to-noise than HTML-Proofer for external links.
  4. Split internal vs external checks. Keep internal-link + curated-external checking blocking on PRs (cheap, deterministic); move full external sweep to a scheduled (cron) job that opens/updates an issue on dead links. PRs stop flaking on third-party outages while dead links still get reported.
  5. Suppress search-path referral URLs by pattern (e.g. *google.com/search*). We intentionally use Google-search referrals where a stable canonical source doesn't exist (kama.md, getting-started.md); a small regex keeps those out of scope rather than relying on status codes.

Suggested direction

Likely a combination: #1 + #3 (or #2) for signal, plus #4 to keep PRs non-flaky, plus #5 for the deliberate search-referral links. Whatever we pick, the acceptance test is: re-introducing a known-dead URL must fail the check.

Acceptance criteria

  • A genuinely dead link (DNS failure / dead page → status 0) fails the check.
  • Known bot-blocking 4xx (403/402/406) and GitHub 429 rate-limiting do not fail the check.
  • Transient timeouts do not cause flaky failures (retry or scheduled-job model).
  • Deliberate search-path referral links are handled by pattern, not by status-code suppression.
  • Config kept in sync between .github/workflows/test-website-links.yml and docs/.vitepress/test-links.sh.

Follow-up from PR #2106, which restored the blanket exemptions as a stopgap to unblock CI.

Metadata

Metadata

Assignees

No one assigned

    Labels

    choreAdministrative chore (usually docs or build related)code qualityTest or general code quality relateddocumentationImprovements or additions to documentation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions