Sources & Attribution

This corpus is a curation of publicly available educational material. Every piece of content belongs to its original creator. This file credits every source included. Please support these creators directly.

Courses & Lecture Series (YouTube transcripts)

Stanford University

CS229 — Machine Learning (Andrew Ng) — 20 lectures
CS230 — Deep Learning (Andrew Ng) — 9 lectures
CS231n — CNNs for Visual Recognition — 14 lectures
CS224n — NLP with Deep Learning — 46 lectures
CS25 — Transformers United — 39 lectures
CS236 — Deep Generative Models — 15 lectures
CS336 — Language Modeling from Scratch — 15 lectures

MIT

6.S191 — Introduction to Deep Learning — 86 lectures

Independent educators & organizations

Andrej Karpathy — main channel + Neural Networks: Zero to Hero — 25 lectures
3Blue1Brown — Neural Networks series (Grant Sanderson) — 9 videos
fast.ai — Practical Deep Learning for Coders (Jeremy Howard) — 48 lectures
DeepLearning.AI — 49 videos
Yannic Kilcher — paper walkthroughs & ML news — 99 videos

Transcripts are auto-generated captions reproduced for research/educational use. All teaching credit is the instructors' and institutions' own. Please watch the originals and subscribe.

Web Articles & Blogs

Jay Alammar (jalammar.github.io) — The Illustrated Transformer; The Illustrated GPT-2; The Illustrated BERT/ELMo; The Illustrated Stable Diffusion; Visualizing Seq2seq with Attention
Lilian Weng (lilianweng.github.io) — Attention? Attention!; The Transformer Family v2.0; What are Diffusion Models?; From Autoencoder to Beta-VAE; From GAN to WGAN; Policy Gradient Algorithms; Prompt Engineering; LLM Powered Autonomous Agents
Sebastian Raschka (magazine.sebastianraschka.com) — Understanding Large Language Models; Understanding Reasoning LLMs; Improving LoRA (DoRA) from Scratch
Andrej Karpathy (karpathy.github.io / medium) — A Recipe for Training Neural Networks; The Unreasonable Effectiveness of RNNs; Yes you should understand backprop
Stanford CS231n notes (cs231n.github.io) — course note pages
Dive into Deep Learning (d2l.ai) — selected chapters
Distill.pub — A Gentle Introduction to Graph Neural Networks; Understanding Convolutions on Graphs; The Building Blocks of Interpretability
Anthropic — Transformer Circuits (transformer-circuits.pub) — A Mathematical Framework for Transformer Circuits; Toy Models of Superposition

Research Papers — originally curated, full text (arXiv — 78)

All papers retain their arXiv ID and URL in the file frontmatter. Distributed under arXiv's non-exclusive license terms; rights remain with the authors. 313 further recent (2024H2–2026) papers and 20 web articles were later added — see Recent additions at the end of this file; each is credited by arXiv ID / URL and authors in its own file frontmatter.

arXiv ID	Title
1207.0580	Improving neural networks by preventing co-adaptation of feature detectors (Dropout)
1301.3781	Efficient Estimation of Word Representations in Vector Space (word2vec)
1310.4546	Distributed Representations of Words and Phrases and their Compositionality
1312.6114	Auto-Encoding Variational Bayes (VAE)
1406.2661	Generative Adversarial Networks
1409.0473	Neural Machine Translation by Jointly Learning to Align and Translate
1409.1556	Very Deep Convolutional Networks (VGG)
1409.3215	Sequence to Sequence Learning with Neural Networks
1409.4842	Going Deeper with Convolutions (GoogLeNet)
1412.6980	Adam: A Method for Stochastic Optimization
1502.03167	Batch Normalization
1506.01497	Faster R-CNN
1512.03385	Deep Residual Learning for Image Recognition (ResNet)
1607.06450	Layer Normalization
1608.06993	Densely Connected Convolutional Networks (DenseNet)
1701.06538	Outrageously Large Neural Networks (Sparsely-Gated MoE)
1706.03762	Attention Is All You Need (Transformer)
1707.06347	Proximal Policy Optimization Algorithms (PPO)
1711.00937	Neural Discrete Representation Learning (VQ-VAE)
1804.02767	YOLOv3: An Incremental Improvement
1810.04805	BERT
1812.04948	A Style-Based Generator Architecture for GANs (StyleGAN)
1907.11692	RoBERTa
1910.10683	Exploring the Limits of Transfer Learning (T5)
1911.02150	Fast Transformer Decoding: One Write-Head is All You Need (MQA)
2001.04451	Reformer: The Efficient Transformer
2001.08361	Scaling Laws for Neural Language Models
2004.04906	Dense Passage Retrieval for Open-Domain QA
2004.05150	Longformer: The Long-Document Transformer
2005.11401	Retrieval-Augmented Generation for Knowledge-Intensive NLP (RAG)
2005.12872	End-to-End Object Detection with Transformers (DETR)
2005.14165	Language Models are Few-Shot Learners (GPT-3)
2006.04768	Linformer: Self-Attention with Linear Complexity
2006.11239	Denoising Diffusion Probabilistic Models (DDPM)
2009.03300	Measuring Massive Multitask Language Understanding (MMLU)
2009.14794	Rethinking Attention with Performers
2010.02502	Denoising Diffusion Implicit Models (DDIM)
2010.11929	An Image is Worth 16x16 Words (ViT)
2101.03961	Switch Transformers
2103.00020	Learning Transferable Visual Models From Natural Language Supervision (CLIP)
2104.09864	RoFormer: Rotary Position Embedding (RoPE)
2106.09685	LoRA: Low-Rank Adaptation of Large Language Models
2107.03374	Evaluating Large Language Models Trained on Code (Codex)
2108.12409	Train Short, Test Long: Attention with Linear Biases (ALiBi)
2111.06377	Masked Autoencoders Are Scalable Vision Learners (MAE)
2112.10752	High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion)
2201.11903	Chain-of-Thought Prompting
2203.02155	Training language models to follow instructions with human feedback (InstructGPT)
2203.11171	Self-Consistency Improves Chain of Thought Reasoning
2203.15556	Training Compute-Optimal Large Language Models (Chinchilla)
2204.06125	Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL·E 2)
2205.14135	FlashAttention
2206.04615	Beyond the Imitation Game (BIG-bench)
2208.07339	LLM.int8()
2210.03629	ReAct: Synergizing Reasoning and Acting in Language Models
2210.17323	GPTQ
2212.09748	Scalable Diffusion Models with Transformers (DiT)
2302.01318	Accelerating LLM Decoding with Speculative Sampling
2302.04761	Toolformer
2302.13971	LLaMA: Open and Efficient Foundation Language Models
2305.10601	Tree of Thoughts
2305.13245	GQA: Grouped-Query Attention
2305.14314	QLoRA
2305.18290	Direct Preference Optimization (DPO)
2306.00978	AWQ: Activation-aware Weight Quantization
2307.08691	FlashAttention-2
2307.09288	Llama 2
2309.06180	Efficient Memory Management for LLM Serving (PagedAttention / vLLM)
2310.06825	Mistral 7B
2312.00752	Mamba: Linear-Time Sequence Modeling with Selective State Spaces
2401.04088	Mixtral of Experts
2404.02258	Mixture-of-Depths
2405.04434	DeepSeek-V2
2405.21060	Transformers are SSMs (State Space Duality / Mamba-2)
2407.08608	FlashAttention-3
2412.19437	DeepSeek-V3 Technical Report
2501.12948	DeepSeek-R1
2502.11089	Native Sparse Attention

A note to creators

If you are the author of any included material and would like it modified or removed, please open an issue or contact the repository owner. See NOTICE.md. Requests will be honored promptly.

Recent additions (2024H2–2026)

Added in a later update to keep the library current. 313 arXiv papers (stored as verbatim abstract + metadata; read the full paper at the linked source) and 20 web articles. Every item is credited by its source URL and authors in its own file frontmatter; this section lists them for attribution.

arXiv papers (313)

arXiv ID	Title
2408.07666	Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
2409.08239	Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
2409.11402	NVLM: Open Frontier-Class Multimodal LLMs
2409.16040	Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts
2410.00037	Moshi: a speech-text foundation model for real-time dialogue
2410.02694	HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly
2410.05779	LightRAG: Simple and Fast Retrieval-Augmented Generation
2410.06293	Accelerated Preference Optimization for Large Language Model Alignment
2410.06885	F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
2410.10393	GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation
2410.10469	Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts
2410.12557	One Step Diffusion via Shortcut Models
2410.14949	On the Convergence and Straightness of Rectified Flow
2410.15595	A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications
2410.16714	Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment
2410.20285	SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement
2410.21357	Energy-Based Diffusion Language Models for Text Generation
2410.24164	π₀: A Vision-Language-Action Flow Model for General Robot Control
2411.04872	FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
2411.07975	JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
2411.13676	Hymba: A Hybrid-head Architecture for Small Language Models
2411.14347	DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
2411.15242	The Zamba2 Suite: Technical Report
2411.18674	Active Data Curation Effectively Distills Large-Scale Multimodal Models
2412.03555	PaliGemma 2: A Family of Versatile VLMs for Transfer
2412.03603	HunyuanVideo: A Systematic Framework For Large Video Generative Models
2412.04984	Frontier Models are Capable of In-context Scheming
2412.06464	Gated Delta Networks: Improving Mamba2 with Delta Rule
2412.08905	Phi-4 Technical Report
2412.10117	CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
2412.14093	Alignment Faking in Large Language Models
2412.15115	Qwen2.5 Technical Report
2412.16441	Towards Graph Foundation Models: Learning Generalities Across Graphs via Task-Trees
2412.16906	Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation
2412.19048	Jasper and Stella: distillation of SOTA embedding models
2501.00656	2 OLMo 2 Furious
2501.00663	Titans: Learning to Memorize at Test Time
2501.03575	Cosmos World Foundation Model Platform for Physical AI
2501.06322	Multi-Agent Collaboration Mechanisms: A Survey of LLMs
2501.07278	Lifelong Learning of Large Language Model based Agents: A Roadmap
2501.08313	MiniMax-01: Scaling Foundation Models with Lightning Attention
2501.09136	Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG
2501.11873	Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
2501.12273	Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement
2501.15103	Each Rank Could be an Expert: Single-Ranked Mixture of Experts LoRA for Multi-Task Learning
2501.15383	Qwen2.5-1M Technical Report
2501.17116	Optimizing Large Language Model Training Using FP4 Quantization
2501.17315	A sketch of an AI control safety case
2501.17811	Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
2501.18823	Transcoders Beat Sparse Autoencoders for Interpretability
2501.19393	s1: Simple Test-Time Scaling
2502.00883	SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters
2502.01113	GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation
2502.01636	Lifelong Knowledge Editing requires Better Regularization
2502.02672	Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes
2502.02737	SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model
2502.05171	Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
2502.05172	Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient
2502.05564	TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
2502.06766	Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs
2502.07272	GENERator: A Long-Context Generative Genomic Foundation Model
2502.07640	Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving
2502.07864	TransMLA: Multi-Head Latent Attention Is All You Need
2502.08606	Distillation Scaling Laws
2502.09638	Jailbreaking to Jailbreak
2502.09992	Large Language Diffusion Models
2502.10248	Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
2502.10297	DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products
2502.10436	MERGE³: Efficient Evolutionary Merging on Consumer-grade GPUs
2502.12118	Scaling Test-Time Compute Without Verification or RL is Suboptimal
2502.12853	S2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
2502.13178	Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis
2502.13189	MoBA: Mixture of Block Attention for Long-Context LLMs
2502.13595	MMTEB: Massive Multilingual Text Embedding Benchmark
2502.13923	Qwen2.5-VL Technical Report
2502.14420	ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model
2502.14837	Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
2502.15304	SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention
2502.15592	Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning
2502.15681	One-step Diffusion Models with f-Divergence Distribution Matching
2502.15828	A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models
2502.16894	Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment (GOAT)
2502.16982	Muon is Scalable for LLM Training
2502.17421	LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
2502.17521	Recent Advances in Large Language Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
2502.18418	Rank1: Test-Time Compute for Reranking in Information Retrieval
2502.19645	Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
2502.20082	LongRoPE2: Near-Lossless LLM Context Window Scaling
2502.21321	LLM Post-Training: A Deep Dive into Reasoning Large Language Models
2503.00030	RSPO: Regularized Self-Play Alignment of Large Language Models
2503.01743	Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
2503.01840	EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
2503.01854	A Comprehensive Survey of Machine Unlearning Techniques for Large Language Models
2503.03746	Process-based Self-Rewarding Language Models
2503.06639	Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification
2503.08099	Whoever Started the Interference Should End It: Guiding Data-Free Model Merging via Task Vectors
2503.09532	SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
2503.09573	Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
2503.09642	Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k
2503.10677	A Survey on Knowledge-Oriented Retrieval-Augmented Generation
2503.11251	Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
2503.12434	A Survey on the Optimization of Large Language Model-based Agents
2503.13436	Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
2503.14456	RWKV-7 "Goose" with Expressive Dynamic State Evolution
2503.14476	DAPO: An Open-Source LLM Reinforcement Learning System at Scale
2503.14734	GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
2503.18102	AgentRxiv: Towards Collaborative Autonomous Research
2503.18893	xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction
2503.18970	Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State-Space Architectures from S4 to Mamba
2503.19551	Scaling Laws of Synthetic Data for Language Models
2503.19786	Gemma 3 Technical Report
2503.20018	Experience Replay Addresses Loss of Plasticity in Continual Learning
2503.20020	Gemini Robotics: Bringing AI into the Physical World
2503.20215	Qwen2.5-Omni Technical Report
2503.20314	Wan: Open and Advanced Large-Scale Video Generative Models
2503.21322	HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation
2503.21614	A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond
2503.23278	Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions
2504.00254	ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning
2504.00891	GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning
2504.04011	Foundation Models for Time Series: A Survey
2504.04423	UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
2504.05118	VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
2504.05352	Achieving Binary Weight and Activation for LLMs using Post-Training Quantization
2504.07164	R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents
2504.08247	Millions of States: Designing a Scalable MoE Architecture with RWKV-7 Meta-learner
2504.08528	On The Landscape of Spoken Language Models: A Comprehensive Survey
2504.10479	InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
2504.10612	Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling
2504.11343	A Minimalist Approach to LLM Reasoning: from Rejection Sampling to REINFORCE
2504.11354	Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning
2504.12216	d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
2504.12285	BitNet b1.58 2B4T Technical Report
2504.12637	Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation
2504.13837	Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
2504.15573	Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction
2504.16054	π₀.₅: A Vision-Language-Action Model with Open-World Generalization
2504.16084	TTRL: Test-Time Reinforcement Learning
2504.16828	Process Reward Models That Think
2504.18415	BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
2504.20571	Reinforcement Learning for Reasoning in Large Language Models with One Training Example
2504.21233	Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math
2504.21318	Phi-4-reasoning Technical Report
2504.21463	RWKV-X: A Linear Complexity Hybrid Language Model
2504.21801	DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition
2505.01420	Evaluating Frontier Models for Stealth and Situational Awareness
2505.02567	Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
2505.02665	A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law
2505.08827	RLSR: Reinforcement Learning from Self Reward
2505.09388	Qwen3 Technical Report
2505.11831	ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
2505.12435	SGDPO: Self-Guided Direct Preference Optimization for Language Model Alignment
2505.13447	Mean Flows for One-step Generative Modeling
2505.14357	Vid2World: Crafting Video Diffusion Models to Interactive World Models
2505.14415	Table Foundation Models: on knowledge pre-training for tabular learning
2505.14432	Rank-K: Test-Time Reasoning for Listwise Reranking
2505.14683	Emerging Properties in Unified Multimodal Pretraining
2505.15116	Graph Foundation Models: A Comprehensive Survey
2505.16324	From Prediction to Perfection: Introducing Refinement to Autoregressive Image Generation
2505.16831	Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
2505.16933	LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
2505.16944	AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios
2505.18774	Disentangling Knowledge Representations for Large Language Model Editing
2505.19115	FP4 All the Way: Fully Quantized Training of LLMs
2505.19770	Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
2505.20003	TabPFN: One Model to Rule Them All?
2505.20171	Long-Context State-Space Video World Models
2505.20347	SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data
2505.21444	Can Large Reasoning Models Self-Train?
2505.21996	VRAG: Learning World Models for Interactive Video Generation
2505.22179	Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
2505.22323	Advancing Expert Specialization for Better MoE
2505.22560	Geometric Hyena Networks for Large-scale Equivariant Learning
2505.22922	Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking
2505.23884	Test-Time Training Done Right
2506.00045	ACE-Step: A Step Towards Music Generation Foundation Model
2506.00054	Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers
2506.00477	Flashbacks to Harmonize Stability and Plasticity in Continual Learning
2506.01963	Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons
2506.02096	SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
2506.03320	The Future of Continual Learning in the Era of Foundation Models: Three Key Directions
2506.03951	Rethinking the Stability-Plasticity Trade-off in Continual Learning from an Architectural Perspective
2506.05176	Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
2506.05584	TabFlex: Scaling Tabular Learning to Millions with Linear Attention
2506.09227	SoK: Machine Unlearning for Large Language Models
2506.09985	V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
2506.11687	Differential Privacy in Machine Learning: A Survey from Symbolic AI to LLMs
2506.12286	The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason
2506.12928	Scaling Test-time Compute for LLM Agents
2506.14098	Toward a Graph Foundation Model: Pre-Training Transformers With Random Walks
2506.14245	Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs
2506.15742	FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
2506.17298	Mercury: Ultra-Fast Language Models Based on Diffusion
2506.17671	TPTT: Transforming Pretrained Transformers into Titans
2506.20743	A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools
2506.21328	Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts
2506.23589	Transition Matching: Scalable and Flexible Generative Modeling
2507.02076	Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs
2507.04771	Efficient Unlearning with Privacy Guarantees
2507.06457	A Systematic Analysis of Hybrid Linear Attention
2507.09404	Scaling Laws for Optimal Data Mixtures
2507.10085	Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning
2507.11005	AdaMuon: Adaptive Muon Optimizer
2507.15855	Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline
2507.17702	Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models
2507.17801	Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling
2507.20198	A Survey of Token Compression for Efficient Multimodal Large Language Models
2508.03613	Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
2508.06743	Analysis of Schedule-Free Nonconvex Optimization
2508.06924	AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning
2508.10104	DINOv3: Self-supervised learning for vision at unprecedented scale
2508.13730	On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions
2508.18265	InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
2509.00691	CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders
2509.01440	Benchmarking Optimizers for Large Language Model Pretraining
2509.02547	The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
2509.03378	Understanding and Improving Shampoo and SOAP via Kullback–Leibler Minimization
2509.04474	Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling
2509.06457	Seasonal forecasting using the GenCast probabilistic machine learning model
2509.09679	ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
2509.09734	MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools
2509.12539	LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations
2509.12892	Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings
2509.16941	SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
2509.21318	SD3.5-Flash: Distribution-Guided Distillation of Generative Flows
2509.23045	Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents
2509.23314	Two-Scale Latent Dynamics for Recurrent-Depth Transformers
2509.23661	LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
2509.23678	Towards a Comprehensive Scaling Law of Mixture-of-Experts
2509.23933	Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms
2509.24389	LLaDA-MoE: A Sparse MoE Diffusion Language Model
2509.24510	Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models
2509.24526	CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models
2509.25127	Score Distillation of Flow Matching Models
2509.25373	From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
2510.00742	How Foundational are Foundation Models for Time Series Forecasting?
2510.01631	Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls
2510.02259	Transformers Discover Molecular Structure Without Graph Priors
2510.02300	Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models
2510.02917	Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders
2510.03313	Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining
2510.03342	Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer
2510.03567	Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs
2510.04147	Self Speculative Decoding for Diffusion Large Language Models
2510.05364	The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures
2510.05491	NorMuon: Making Muon more efficient and scalable
2510.09586	Vision Language Models: A Survey of 26K Papers
2510.10223	You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs
2510.13003	OPLoRA: Orthogonal Projection LoRA Prevents Catastrophic Forgetting during Parameter-Efficient Fine-Tuning
2510.13169	Universally Invariant Learning in Equivariant GNNs
2510.13721	NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching
2510.15821	Chronos-2: From Univariate to Universal Forecasting
2510.17896	Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism
2510.18471	CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
2510.21204	Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models
2510.22733	E2Rank: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker
2510.27072	Towards Understanding Self-play for LLM Reasoning
2511.00040	Semi-Supervised Preference Optimization with Limited Feedback
2511.01695	Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding
2511.01815	KV Cache Transform Coding for Compact Storage in LLM Inference
2511.03690	The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents
2511.07328	Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training
2511.09057	PAN: A World Model for General, Interactable, and Long-Horizon World Simulation
2511.11698	Moirai 2.0: When Less Is More for Time Series Forecasting
2511.11707	FSC-Net: Fast-Slow Consolidation Networks for Continual Learning
2511.12181	MixAR: Mixture Autoregressive Image Generation
2511.12347	VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
2511.15375	Parameter Importance-Driven Continual Learning for Foundation Models
2511.15992	Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis
2511.18397	Natural Emergent Misalignment from Reward Hacking in Production RL
2511.18936	SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression
2511.21437	A Systematic Study of In-the-Wild Model Merging for Large Language Models
2511.22009	StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation
2511.22570	DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
2511.22699	Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
2512.04268	The Initialization Determines Whether In-Context Learning Is Gradient Descent
2512.05084	Gradient Descent with Provably Tuned Learning-rate Schedules
2512.05534	A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima
2512.05817	Utility Boundary of Dataset Distillation: Scaling and Configuration-Coverage Laws
2512.05916	KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity
2512.10858	Scaling Behavior of Discrete Diffusion Language Models
2512.15657	SoFlow: Solution Flow Models for One-Step Generative Modeling
2512.18470	SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
2512.20957	One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents
2512.23675	End-to-End Test-Time Training for Long Context
2601.03774	Scalable Machine Learning Force Fields for Macromolecular Systems Through Long-Range Aware Message Passing
2601.04823	DR-LoRA: Dynamic Rank LoRA for Fine-Tuning Mixture-of-Experts Models
2601.10904	ARC Prize 2025: Technical Report
2601.12560	Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation of Large Language Model Agents
2601.22156	Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts
2602.01357	Your Self-Play Algorithm is Secretly an Adversarial Imitator: Understanding LLM Self-Play through the Lens of Imitation Learning
2602.02571	Trajectory Consistency for One-Step Generation on Euler Mean Flows
2602.03442	A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces
2602.04768	Billion-Scale Graph Foundation Models
2602.11139	TabICLv2: A better, faster, scalable, and open tabular foundation model
2602.20117	ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models
2603.01639	Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning
2603.03597	NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training
2603.05168	Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity
2603.09938	Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions
2603.12658	Continual Learning in Large Language Models: Methods, Challenges, and Opportunities
2603.13372	The ARC of Progress towards AGI: A Living Survey of Abstraction and Reasoning
2603.15569	Mamba-3: Improved Sequence Modeling using State Space Principles
2603.25248	ColBERT-Att: Late-Interaction Meets Attention for Enhanced Retrieval
2604.01411	Test-Time Scaling Makes Overtraining Compute-Optimal
2604.07615	ADAG: Automatically Describing Attribution Graphs
2604.08178	Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling
2604.19089	Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression
2604.20329	Image Generators are Generalist Vision Learners
2604.24618	Evaluating whether AI models would sabotage AI safety research
2605.06676	LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction
2605.22791	Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
2605.25979	LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Web articles (20)

Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad — deepmind.google — https://deepmind.google/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/
Bring your ideas to life: Veo 2 video generation available for developers — developers.googleblog.com — https://developers.googleblog.com/veo-2-video-generation-now-generally-available/
Chai-1: Decoding the molecular interactions of life — biorxiv.org — https://www.biorxiv.org/content/10.1101/2024.10.10.615955v1
Circuit Tracing: Revealing Computational Graphs in Language Models — transformer-circuits.pub — https://transformer-circuits.pub/2025/attribution-graphs/methods.html
From U-Nets to DiTs: The Architectural Evolution of Text-to-Image Diffusion Models (2021–2025) — iclr-blogposts.github.io — https://iclr-blogposts.github.io/2026/blog/2026/diffusion-architecture-evolution/
Gemini Diffusion — Google DeepMind — deepmind.google — https://deepmind.google/models/gemini-diffusion/
Genie 2: A Large-Scale Foundation World Model — deepmind.google — https://deepmind.google/blog/genie-2-a-large-scale-foundation-world-model/
Genie 3: A New Frontier for World Models — deepmind.google — https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/
Genome modeling and design across all domains of life with Evo 2 — biorxiv.org — https://www.biorxiv.org/content/10.1101/2025.02.18.638918v1
Mochi 1: A new SOTA in open text-to-video — genmo.ai — https://www.genmo.ai/blog/mochi-1-a-new-sota-in-open-text-to-video
On the Biology of a Large Language Model — transformer-circuits.pub — https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Reward Hacking in Reinforcement Learning — lilianweng.github.io — https://lilianweng.github.io/posts/2024-11-28-reward-hacking/
Simulating 500 million years of evolution with a language model — biorxiv.org — https://www.biorxiv.org/content/10.1101/2024.07.01.600583v1
SmolLM3: smol, multilingual, long-context reasoner — huggingface.co — https://huggingface.co/blog/smollm3
The State Of LLMs 2025: Progress, Progress, and Predictions — magazine.sebastianraschka.com — https://magazine.sebastianraschka.com/p/state-of-llms-2025
The State of Reinforcement Learning for LLM Reasoning — magazine.sebastianraschka.com — https://magazine.sebastianraschka.com/p/the-state-of-llm-reasoning-model-training
TimesFM 2.5: Smaller, Longer-Context Foundation Model Leading GIFT-Eval — huggingface.co — https://huggingface.co/google/timesfm-2.5-200m-pytorch
Titans + MIRAS: Helping AI have long-term memory — research.google — https://research.google/blog/titans-miras-helping-ai-have-long-term-memory/
Why We Think — lilianweng.github.io — https://lilianweng.github.io/posts/2025-05-01-thinking/
π₀ and π₀-FAST: Vision-Language-Action Models for General Robot Control — huggingface.co — https://huggingface.co/blog/pi0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sources & Attribution

Courses & Lecture Series (YouTube transcripts)

Stanford University

MIT

Independent educators & organizations

Web Articles & Blogs

Research Papers — originally curated, full text (arXiv — 78)

A note to creators

Recent additions (2024H2–2026)

arXiv papers (313)

Web articles (20)

FilesExpand file tree

SOURCES.md

Latest commit

History

SOURCES.md

File metadata and controls

Sources & Attribution

Courses & Lecture Series (YouTube transcripts)

Stanford University

MIT

Independent educators & organizations

Web Articles & Blogs

Research Papers — originally curated, full text (arXiv — 78)

A note to creators

Recent additions (2024H2–2026)

arXiv papers (313)

Web articles (20)