This repo is a curated, machine-readable corpus of ML education: 923 Markdown
documents (~11M tokens) — research papers, course lecture transcripts, and
canonical explainer articles — each with structured YAML frontmatter. This file
tells an AI agent how to navigate and retrieve from it. (Read by Cursor, Codex,
Copilot, Gemini CLI, Aider, Zed, etc. Claude Code reads CLAUDE.md, which points
here.)
corpus/
├── papers/ 391 docs — arXiv papers (78 full text + 313 abstract+metadata). Filename = arXiv id (1706.03762.md)
├── youtube/ 474 docs — lecture transcripts, grouped by course/channel. Filename = video id
└── web/ 58 docs — explainer articles, grouped by domain
corpus/INDEX.md — master list of every doc with its title
atlas/ — human/topic navigation layer (Maps of Content per topic)
atlas/TAGS.md — the controlled tag vocabulary (authoritative)
Files are organized by source on disk, but tagged by topic in frontmatter — so retrieve by tag/content, not by folder.
Every doc starts with a YAML block. Fields you'll use for retrieval:
title: "Attention Is All You Need" # human title
source: "arxiv" | "youtube" | "web"
url: "http://arxiv.org/abs/1706.03762v7" # ALWAYS cite this
authors: [...] # papers
published:"2017-06-12" # papers (YYYY-MM-DD)
channel: "Stanford CS25" # lectures
topics: [...] # original free-form tags (papers/web only) — provenance, do not rely on
tags: [topic/transformers-attention, level/advanced, medium/paper, task/language, technique/attention]
aliases: ["Attention Is All You Need", ...]The tags: field is the controlled, queryable layer. Namespaces:
topic/ (17 subjects), level/ (intro·intermediate·advanced·frontier),
medium/ (paper·lecture·article), task/ (vision·language·…), technique/
(cnn·transformer·diffusion·lora-peft·…). See atlas/TAGS.md for the full list.
- Find candidates. Prefer full-text search over the corpus — it is the most
reliable signal (
grep -ril "flash attention" corpus/, or your tool's search). Narrow with tags when useful:grep -rl "topic/generative-models" corpus/. - Read the top 3–5 matches (read the frontmatter + relevant sections, not always the whole file — transcripts can be long).
- Answer grounded in those docs, and CITE each claim with the doc's
frontmatter
url(and title). Never invent references — every fact here is traceable to a real source. - If the user asks to learn a topic, use the relevant
atlas/topics/*.mdMap of Content as a reading path.
- Tags are auto-assigned (keyword rules + a content-reading pass). They're good for narrowing, but when a tag filter returns little, fall back to full-text search — don't conclude the corpus lacks the topic.
- The same lecture can carry multiple
topic/tags; that's intended. - 9 Karpathy videos appear under two folders (same content, same tags).
- This repo contains no original content — it's a reformatting of others' work. Always attribute the original author/source, never this repo.
neural-network-foundations · classical-ml · computer-vision ·
sequence-models-rnn · transformers-attention · language-models ·
efficient-architectures · generative-models · multimodal ·
reinforcement-learning · alignment-rlhf · reasoning-agents ·
efficiency-systems · interpretability · evaluation-trust ·
ml-engineering · ai-industry-news