You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Draft v0.0. Verification report for papers/coding-philosophy-sources.md §3 Tier C.
Verified 2026-05-11 via WebFetch + WebSearch. Token estimates are
rough (assume ~4 chars/tok, English prose).
~60+ long-form articles (Operations Excellence, Architecture, Software Delivery, Mechanical Sympathy categories). Index loads dynamically — couldn't enumerate via WebFetch shim. Est. ~400-700 k tok.
Quotability
excerpt + cite — full ingest is legally borderline. Treat as fair-use research-only; flag for commercial-use risk.
PII risk
LOW — articles are pattern/principle-shaped ("Caching challenges and strategies", "Static stability using AZs"). Authored by senior engineers but content is not blame-attributing.
Status
VERIFIED
Top-shelf operational philosophy — load-shedding, jitter/backoff, control-plane vs data-plane, "constant work" pattern. Highest signal-per-token in Tier C. Recommend ingest with attribution and a vendor-fair-use canon tag.
excerpt only legally. NC clause is the blocker — training a commercial-grade model arguably violates it; training a research artifact is gray. ND clause means tokenization may itself be a "derivative".
PII risk
LOW — Google deliberately de-personalizes their canonical "postmortem-as-learning" framing. Some named authors but no blame attribution.
Status
VERIFIED — license is the binding constraint
The canonical text for blameless-postmortem culture. Note: the Site Reliability Workbook (https://sre.google/workbook/) and classroom materials are CC BY 4.0 (commercial-OK) per Google's site signal — prefer the Workbook for training. Re-check per-page footer before ingest.
~20+ posts under post-mortem tag, range Oct-2023 → May-2026. With pagination, est. 40-60 posts. Each post is long-form (~3-8 k tok). Total est. ~200-400 k tok.
Quotability
full ingest is safe — explicit ai-train signal is the strongest license clarity in Tier C.
PII risk
LOW — Cloudflare post-mortems attribute by author byline but do not blame individuals; root causes are framed systemically.
No explicit license on danluu.com (robots.txt only contains a sitemap). The repo has no LICENSE file either. Source list called this "CC-BY" — that claim is UNVERIFIED.
Volume
Essay: ~10 k tok. Repo: 12.1 k stars, ~389 commits, ~100s of linked post-mortems, but the repo itself is mostly links, not text. PR backlog flagged ("Sorry for the delay merging PRs").
Quotability
Essay: excerpt-only until license confirmed. Repo: useful as a link-spine for crawler seeds, not as text.
PII risk
LOW for the essay. The linked third-party post-mortems vary.
Status
PARTIAL — license claim in source list is wrong. Treat as research-fair-use until danluu clarifies.
CC BY 4.0 per Google's SRE materials signal (verify per-page) — commercial-OK, unlike the SRE Book which is BY-NC-ND.
Volume
~25 chapters, ~300-400 k tok.
Quotability
full ingest if CC BY 4.0 confirmed.
PII risk
LOW — case studies from Evernote, Home Depot, NYT; no blame.
Status
★ STRONGLY RECOMMENDED — preferred over SRE Book for license clarity.
This is the most under-rated source. License is permissive enough to bulk-ingest, and the content is hands-on post-mortem / incident-management material. Should sit at the top of the ingestion order.
Low. Stripe rarely publishes formal post-mortems; the 2019-07-10 narrative (David Singleton) is the canonical one. Most Stripe incident lore lives in Increment magazine + USENIX talks. Est. ~10-20 k tok.
4 long-form posts + USENIX SRECon talk (Laura de Vesine). Est. ~25-40 k tok.
Quotability
excerpt.
PII risk
LOW — systemic and remarkably honest. Names a few engineers by role (Incident Commander) but not in a blame context.
Why valuable
One of the best multi-region cascade-failure post-mortems in print. The Cilium-CNI / systemd-networkd interaction story is canon-grade material on blast-radius reasoning.
1 canonical post (~3-5 k tok). Other Fastly post-mortems are sparse.
Quotability
excerpt.
PII risk
LOW.
Why
Textbook example of "valid customer config triggers latent bug → 85% network blast radius in 49 min". Pairs perfectly with the Cloudflare November 2025 post-mortem (config-induced cascade).
PyPI is part of PSF — likely PSF terms. Posts authored by Trail of Bits engineers carry implicit research-fair-use.
Volume
~5-10 named incident reports (Organization Team Privileges Apr-2025, LiteLLM/Telnyx supply-chain Apr-2026, several token-exfil events). Est. ~15-25 k tok.
Quotability
excerpt.
PII risk
MEDIUM — does name external reporters and (sometimes) attackers. Strip names.
Why
Distinct genre: supply-chain / registry post-mortems, not infra-outage. Important for teaching defense-in-depth philosophy.
Repo: not explicitly stated; community-contributed.
Volume
~70+ curated failure stories, each linking to external talk/blog/video. Est. ~100-200 k tok if you follow links and ingest the texts (not the index itself).
Quotability
as a seed list for crawling; not for direct text ingest.
hundreds of posts on incident analysis through a resilience-engineering lens. Est. ~200-400 k tok.
Quotability
excerpt only.
PII risk
LOW — academic-style analysis, no blame.
Why
One of the few sources that meta-analyzes post-mortems through Woods/Cook/Dekker frameworks. Distinct signal vs. raw vendor RCAs — teaches how to read a post-mortem.
Use sparingly — BY-NC-ND is the harshest license here. Excerpt for canon quotes (blameless-postmortem ch.15) only.
11
★ Heroku post-mortems
Small but classic "unsafe auto-update" failure-mode stories.
12
★ Discord engineering
Voice/session post-mortems; small but vivid.
13
★ Increment archive
Dormant magazine; ingest once, mine for operational essays.
14
★ Fastly Jun-2021
Single canonical post; pair with Cloudflare config-cascade posts.
15
★ Stripe 2019-07-10
Sparse but the one extant Stripe RCA is genre-defining.
16
★ PyPI incident reports
Different genre (supply-chain); essential for that axis.
17
★ GitLab Handbook / 2017 db1
Process + one classic post-mortem; CC-BY-SA.
18
k8s/community postmortems
Apache-2 but tiny (1-2 files). Include for completeness.
19
danluu/post-mortems repo + essay
Use as link-spine for crawl seeds; license unverified.
20
k8s.af + awesome-postmortem lists
Link-spines only.
§5 PII / blame filter design
Tier C content fans out across genres with different PII profiles. Recommended pipeline (cheap → expensive):
Regex / NER first pass — covers the long tail.
Strip patterns matching @[a-z0-9_-]+, <firstname> <lastname> from the X team, our engineer <name>, GitHub-handle backticks \[a-z0-9-]+`` only when adjacent to "engineer", "wrote", "deployed", "merged", "pushed", "approved".
Strip pager IDs, ticket IDs that aren't useful (SEV-12345, INC-99999) unless cross-referenced to a fix commit.
Use a small NER model (spaCy en_core_web_sm + custom ruleset) to mask PERSON entities unless they appear in author-byline metadata (which we want to preserve as provenance, separated from training tokens).
Heuristic blame-phrase stripping — small allow/deny phrase list.
Deny phrases (rewrite or strip the surrounding sentence): "human error", "the engineer who", "[name] forgot to", "[name] mistakenly", "should have caught", "failed to follow".
Allow phrases (keep — these are the learning signal): "the change", "the deploy", "the configuration", "the rollback procedure", "the runbook did not cover", "the alarm did not fire because", "blast radius", "fail small".
LLM-pass for ambiguous cases — cheap model (Haiku-class), one-shot prompt:
"Rewrite the following post-mortem excerpt to remove any individual attribution of fault while preserving every technical detail, timestamp, command, configuration value, and root-cause description. If the excerpt is already blameless, return it unchanged."
Run only on chunks where the regex/NER pass flagged ≥1 candidate. Budget: ~5-10% of corpus.
Per-source policy overrides — encode in the canon-tag schema:
AWS PES, Azure PIRs, Cloudflare, Datadog, AWS Builders' Library, Discord, SRE Book/Workbook → blameless-by-construction; skip step 3 to save spend.
GitHub blog, Heroku, Stripe → run full pipeline, modest risk of role-naming.
PyPI incident reports, surfingcomplexity, danluu's linked third-party post-mortems → highest risk, run full pipeline + manual spot-check on every chunk that mentions a non-Anthropic-affiliated individual.
Skip-section rules — drop entire sections when:
section header contains "Personnel" / "Roles" / "Who" without "Why".
"Acknowledgements" / "Thanks to" sections — useful metadata, but not training prose.
Provenance preserved separately — store author + URL + license in front-matter; strip from token stream. This is what canon-tag schema is for.
§6 Token roll-up
bucket
est. tokens
notes
Cloudflare post-mortem tag
200-400 k
full ingest OK
Google SRE Workbook
300-400 k
CC BY 4.0
Google SRE Book (excerpts only)
50-100 k after filtering
BY-NC-ND limits
Azure PIRs
150-300 k
proprietary, fair-use
AWS Builders' Library
400-700 k
proprietary, fair-use
AWS PES
30-60 k
proprietary, fair-use
GitHub Availability + history
80-150 k
proprietary, fair-use
Datadog Mar-2023 series
25-40 k
proprietary, fair-use
Surfing Complexity
200-400 k
personal blog, fair-use
GCP incidents.json
100-200 k
structured
Heroku + Discord + Stripe + Fastly
80-130 k combined
proprietary, fair-use
Increment Magazine
300-500 k
proprietary, fair-use, dormant (stable)
PyPI blog
15-25 k
PSF / fair-use
GitLab Handbook (incident sections)
30-80 k
CC BY-SA
k8s postmortems + k8s.af + awesome-*
5-15 k direct text
mostly link-spines
danluu essay + linked roundups
10-30 k direct
+ crawl seeds
Total Tier C raw estimate: ~2.0 – 3.5 M tokens.
At Tier C's likely budget within the ~3 B-tok philosophy mix (Tier A/B/D dominate by weight), 2-3 M raw tokens is already over-supply. Recommendation: ingest the top 8-10 sources fully (ordered §4), then sample-quote the rest. With dedup + PII filter, expect ~1.0-1.5 M usable Tier C tokens.
§7 Action items back into source list
Fix papers/coding-philosophy-sources.md §3:
replace bare "GitHub status post-mortems" with github.blog/tag/github-availability-report/.
replace bare "k8s.io postmortem repo" with github.com/kubernetes/community/tree/master/sig-cluster-lifecycle/postmortems and note volume is small.
remove "CC-BY, full" claim on danluu — license unverified.
add: SRE Workbook (CC BY 4.0) above the SRE Book.
Add vendor-fair-use and commercial-risk-flagged canon-tags to schema.
Pre-commit hook: every Tier C chunk must carry pii-pass: regex|llm|none provenance.
Mark morethanseven.net claim as UNVERIFIED — could not locate.