This corpus is a curation of publicly available educational material. Every piece of content belongs to its original creator. This file credits every source included. Please support these creators directly.
- CS229 — Machine Learning (Andrew Ng) — 20 lectures
- CS230 — Deep Learning (Andrew Ng) — 9 lectures
- CS231n — CNNs for Visual Recognition — 14 lectures
- CS224n — NLP with Deep Learning — 46 lectures
- CS25 — Transformers United — 39 lectures
- CS236 — Deep Generative Models — 15 lectures
- CS336 — Language Modeling from Scratch — 15 lectures
- 6.S191 — Introduction to Deep Learning — 86 lectures
- Andrej Karpathy — main channel + Neural Networks: Zero to Hero — 25 lectures
- 3Blue1Brown — Neural Networks series (Grant Sanderson) — 9 videos
- fast.ai — Practical Deep Learning for Coders (Jeremy Howard) — 48 lectures
- DeepLearning.AI — 49 videos
- Yannic Kilcher — paper walkthroughs & ML news — 99 videos
Transcripts are auto-generated captions reproduced for research/educational use. All teaching credit is the instructors' and institutions' own. Please watch the originals and subscribe.
- Jay Alammar (jalammar.github.io) — The Illustrated Transformer; The Illustrated GPT-2; The Illustrated BERT/ELMo; The Illustrated Stable Diffusion; Visualizing Seq2seq with Attention
- Lilian Weng (lilianweng.github.io) — Attention? Attention!; The Transformer Family v2.0; What are Diffusion Models?; From Autoencoder to Beta-VAE; From GAN to WGAN; Policy Gradient Algorithms; Prompt Engineering; LLM Powered Autonomous Agents
- Sebastian Raschka (magazine.sebastianraschka.com) — Understanding Large Language Models; Understanding Reasoning LLMs; Improving LoRA (DoRA) from Scratch
- Andrej Karpathy (karpathy.github.io / medium) — A Recipe for Training Neural Networks; The Unreasonable Effectiveness of RNNs; Yes you should understand backprop
- Stanford CS231n notes (cs231n.github.io) — course note pages
- Dive into Deep Learning (d2l.ai) — selected chapters
- Distill.pub — A Gentle Introduction to Graph Neural Networks; Understanding Convolutions on Graphs; The Building Blocks of Interpretability
- Anthropic — Transformer Circuits (transformer-circuits.pub) — A Mathematical Framework for Transformer Circuits; Toy Models of Superposition
All papers retain their arXiv ID and URL in the file frontmatter. Distributed under arXiv's non-exclusive license terms; rights remain with the authors. 313 further recent (2024H2–2026) papers and 20 web articles were later added — see Recent additions at the end of this file; each is credited by arXiv ID / URL and authors in its own file frontmatter.
| arXiv ID | Title |
|---|---|
| 1207.0580 | Improving neural networks by preventing co-adaptation of feature detectors (Dropout) |
| 1301.3781 | Efficient Estimation of Word Representations in Vector Space (word2vec) |
| 1310.4546 | Distributed Representations of Words and Phrases and their Compositionality |
| 1312.6114 | Auto-Encoding Variational Bayes (VAE) |
| 1406.2661 | Generative Adversarial Networks |
| 1409.0473 | Neural Machine Translation by Jointly Learning to Align and Translate |
| 1409.1556 | Very Deep Convolutional Networks (VGG) |
| 1409.3215 | Sequence to Sequence Learning with Neural Networks |
| 1409.4842 | Going Deeper with Convolutions (GoogLeNet) |
| 1412.6980 | Adam: A Method for Stochastic Optimization |
| 1502.03167 | Batch Normalization |
| 1506.01497 | Faster R-CNN |
| 1512.03385 | Deep Residual Learning for Image Recognition (ResNet) |
| 1607.06450 | Layer Normalization |
| 1608.06993 | Densely Connected Convolutional Networks (DenseNet) |
| 1701.06538 | Outrageously Large Neural Networks (Sparsely-Gated MoE) |
| 1706.03762 | Attention Is All You Need (Transformer) |
| 1707.06347 | Proximal Policy Optimization Algorithms (PPO) |
| 1711.00937 | Neural Discrete Representation Learning (VQ-VAE) |
| 1804.02767 | YOLOv3: An Incremental Improvement |
| 1810.04805 | BERT |
| 1812.04948 | A Style-Based Generator Architecture for GANs (StyleGAN) |
| 1907.11692 | RoBERTa |
| 1910.10683 | Exploring the Limits of Transfer Learning (T5) |
| 1911.02150 | Fast Transformer Decoding: One Write-Head is All You Need (MQA) |
| 2001.04451 | Reformer: The Efficient Transformer |
| 2001.08361 | Scaling Laws for Neural Language Models |
| 2004.04906 | Dense Passage Retrieval for Open-Domain QA |
| 2004.05150 | Longformer: The Long-Document Transformer |
| 2005.11401 | Retrieval-Augmented Generation for Knowledge-Intensive NLP (RAG) |
| 2005.12872 | End-to-End Object Detection with Transformers (DETR) |
| 2005.14165 | Language Models are Few-Shot Learners (GPT-3) |
| 2006.04768 | Linformer: Self-Attention with Linear Complexity |
| 2006.11239 | Denoising Diffusion Probabilistic Models (DDPM) |
| 2009.03300 | Measuring Massive Multitask Language Understanding (MMLU) |
| 2009.14794 | Rethinking Attention with Performers |
| 2010.02502 | Denoising Diffusion Implicit Models (DDIM) |
| 2010.11929 | An Image is Worth 16x16 Words (ViT) |
| 2101.03961 | Switch Transformers |
| 2103.00020 | Learning Transferable Visual Models From Natural Language Supervision (CLIP) |
| 2104.09864 | RoFormer: Rotary Position Embedding (RoPE) |
| 2106.09685 | LoRA: Low-Rank Adaptation of Large Language Models |
| 2107.03374 | Evaluating Large Language Models Trained on Code (Codex) |
| 2108.12409 | Train Short, Test Long: Attention with Linear Biases (ALiBi) |
| 2111.06377 | Masked Autoencoders Are Scalable Vision Learners (MAE) |
| 2112.10752 | High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion) |
| 2201.11903 | Chain-of-Thought Prompting |
| 2203.02155 | Training language models to follow instructions with human feedback (InstructGPT) |
| 2203.11171 | Self-Consistency Improves Chain of Thought Reasoning |
| 2203.15556 | Training Compute-Optimal Large Language Models (Chinchilla) |
| 2204.06125 | Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL·E 2) |
| 2205.14135 | FlashAttention |
| 2206.04615 | Beyond the Imitation Game (BIG-bench) |
| 2208.07339 | LLM.int8() |
| 2210.03629 | ReAct: Synergizing Reasoning and Acting in Language Models |
| 2210.17323 | GPTQ |
| 2212.09748 | Scalable Diffusion Models with Transformers (DiT) |
| 2302.01318 | Accelerating LLM Decoding with Speculative Sampling |
| 2302.04761 | Toolformer |
| 2302.13971 | LLaMA: Open and Efficient Foundation Language Models |
| 2305.10601 | Tree of Thoughts |
| 2305.13245 | GQA: Grouped-Query Attention |
| 2305.14314 | QLoRA |
| 2305.18290 | Direct Preference Optimization (DPO) |
| 2306.00978 | AWQ: Activation-aware Weight Quantization |
| 2307.08691 | FlashAttention-2 |
| 2307.09288 | Llama 2 |
| 2309.06180 | Efficient Memory Management for LLM Serving (PagedAttention / vLLM) |
| 2310.06825 | Mistral 7B |
| 2312.00752 | Mamba: Linear-Time Sequence Modeling with Selective State Spaces |
| 2401.04088 | Mixtral of Experts |
| 2404.02258 | Mixture-of-Depths |
| 2405.04434 | DeepSeek-V2 |
| 2405.21060 | Transformers are SSMs (State Space Duality / Mamba-2) |
| 2407.08608 | FlashAttention-3 |
| 2412.19437 | DeepSeek-V3 Technical Report |
| 2501.12948 | DeepSeek-R1 |
| 2502.11089 | Native Sparse Attention |
If you are the author of any included material and would like it modified or removed, please open an issue or contact the repository owner. See NOTICE.md. Requests will be honored promptly.
Added in a later update to keep the library current. 313 arXiv papers (stored as verbatim abstract + metadata; read the full paper at the linked source) and 20 web articles. Every item is credited by its source URL and authors in its own file frontmatter; this section lists them for attribution.
| arXiv ID | Title |
|---|---|
| 2408.07666 | Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities |
| 2409.08239 | Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources |
| 2409.11402 | NVLM: Open Frontier-Class Multimodal LLMs |
| 2409.16040 | Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts |
| 2410.00037 | Moshi: a speech-text foundation model for real-time dialogue |
| 2410.02694 | HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly |
| 2410.05779 | LightRAG: Simple and Fast Retrieval-Augmented Generation |
| 2410.06293 | Accelerated Preference Optimization for Large Language Model Alignment |
| 2410.06885 | F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching |
| 2410.10393 | GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation |
| 2410.10469 | Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts |
| 2410.12557 | One Step Diffusion via Shortcut Models |
| 2410.14949 | On the Convergence and Straightness of Rectified Flow |
| 2410.15595 | A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications |
| 2410.16714 | Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment |
| 2410.20285 | SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement |
| 2410.21357 | Energy-Based Diffusion Language Models for Text Generation |
| 2410.24164 | π₀: A Vision-Language-Action Flow Model for General Robot Control |
| 2411.04872 | FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI |
| 2411.07975 | JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation |
| 2411.13676 | Hymba: A Hybrid-head Architecture for Small Language Models |
| 2411.14347 | DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding |
| 2411.15242 | The Zamba2 Suite: Technical Report |
| 2411.18674 | Active Data Curation Effectively Distills Large-Scale Multimodal Models |
| 2412.03555 | PaliGemma 2: A Family of Versatile VLMs for Transfer |
| 2412.03603 | HunyuanVideo: A Systematic Framework For Large Video Generative Models |
| 2412.04984 | Frontier Models are Capable of In-context Scheming |
| 2412.06464 | Gated Delta Networks: Improving Mamba2 with Delta Rule |
| 2412.08905 | Phi-4 Technical Report |
| 2412.10117 | CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models |
| 2412.14093 | Alignment Faking in Large Language Models |
| 2412.15115 | Qwen2.5 Technical Report |
| 2412.16441 | Towards Graph Foundation Models: Learning Generalities Across Graphs via Task-Trees |
| 2412.16906 | Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation |
| 2412.19048 | Jasper and Stella: distillation of SOTA embedding models |
| 2501.00656 | 2 OLMo 2 Furious |
| 2501.00663 | Titans: Learning to Memorize at Test Time |
| 2501.03575 | Cosmos World Foundation Model Platform for Physical AI |
| 2501.06322 | Multi-Agent Collaboration Mechanisms: A Survey of LLMs |
| 2501.07278 | Lifelong Learning of Large Language Model based Agents: A Roadmap |
| 2501.08313 | MiniMax-01: Scaling Foundation Models with Lightning Attention |
| 2501.09136 | Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG |
| 2501.11873 | Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models |
| 2501.12273 | Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement |
| 2501.15103 | Each Rank Could be an Expert: Single-Ranked Mixture of Experts LoRA for Multi-Task Learning |
| 2501.15383 | Qwen2.5-1M Technical Report |
| 2501.17116 | Optimizing Large Language Model Training Using FP4 Quantization |
| 2501.17315 | A sketch of an AI control safety case |
| 2501.17811 | Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling |
| 2501.18823 | Transcoders Beat Sparse Autoencoders for Interpretability |
| 2501.19393 | s1: Simple Test-Time Scaling |
| 2502.00883 | SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters |
| 2502.01113 | GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation |
| 2502.01636 | Lifelong Knowledge Editing requires Better Regularization |
| 2502.02672 | Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes |
| 2502.02737 | SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model |
| 2502.05171 | Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach |
| 2502.05172 | Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient |
| 2502.05564 | TabICL: A Tabular Foundation Model for In-Context Learning on Large Data |
| 2502.06766 | Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs |
| 2502.07272 | GENERator: A Long-Context Generative Genomic Foundation Model |
| 2502.07640 | Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving |
| 2502.07864 | TransMLA: Multi-Head Latent Attention Is All You Need |
| 2502.08606 | Distillation Scaling Laws |
| 2502.09638 | Jailbreaking to Jailbreak |
| 2502.09992 | Large Language Diffusion Models |
| 2502.10248 | Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model |
| 2502.10297 | DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products |
| 2502.10436 | MERGE³: Efficient Evolutionary Merging on Consumer-grade GPUs |
| 2502.12118 | Scaling Test-Time Compute Without Verification or RL is Suboptimal |
| 2502.12853 | S2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning |
| 2502.13178 | Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis |
| 2502.13189 | MoBA: Mixture of Block Attention for Long-Context LLMs |
| 2502.13595 | MMTEB: Massive Multilingual Text Embedding Benchmark |
| 2502.13923 | Qwen2.5-VL Technical Report |
| 2502.14420 | ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model |
| 2502.14837 | Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs |
| 2502.15304 | SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention |
| 2502.15592 | Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning |
| 2502.15681 | One-step Diffusion Models with f-Divergence Distribution Matching |
| 2502.15828 | A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models |
| 2502.16894 | Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment (GOAT) |
| 2502.16982 | Muon is Scalable for LLM Training |
| 2502.17421 | LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification |
| 2502.17521 | Recent Advances in Large Language Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation |
| 2502.18418 | Rank1: Test-Time Compute for Reranking in Information Retrieval |
| 2502.19645 | Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success |
| 2502.20082 | LongRoPE2: Near-Lossless LLM Context Window Scaling |
| 2502.21321 | LLM Post-Training: A Deep Dive into Reasoning Large Language Models |
| 2503.00030 | RSPO: Regularized Self-Play Alignment of Large Language Models |
| 2503.01743 | Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs |
| 2503.01840 | EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test |
| 2503.01854 | A Comprehensive Survey of Machine Unlearning Techniques for Large Language Models |
| 2503.03746 | Process-based Self-Rewarding Language Models |
| 2503.06639 | Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification |
| 2503.08099 | Whoever Started the Interference Should End It: Guiding Data-Free Model Merging via Task Vectors |
| 2503.09532 | SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability |
| 2503.09573 | Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models |
| 2503.09642 | Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k |
| 2503.10677 | A Survey on Knowledge-Oriented Retrieval-Augmented Generation |
| 2503.11251 | Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model |
| 2503.12434 | A Survey on the Optimization of Large Language Model-based Agents |
| 2503.13436 | Unified Autoregressive Visual Generation and Understanding with Continuous Tokens |
| 2503.14456 | RWKV-7 "Goose" with Expressive Dynamic State Evolution |
| 2503.14476 | DAPO: An Open-Source LLM Reinforcement Learning System at Scale |
| 2503.14734 | GR00T N1: An Open Foundation Model for Generalist Humanoid Robots |
| 2503.18102 | AgentRxiv: Towards Collaborative Autonomous Research |
| 2503.18893 | xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction |
| 2503.18970 | Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State-Space Architectures from S4 to Mamba |
| 2503.19551 | Scaling Laws of Synthetic Data for Language Models |
| 2503.19786 | Gemma 3 Technical Report |
| 2503.20018 | Experience Replay Addresses Loss of Plasticity in Continual Learning |
| 2503.20020 | Gemini Robotics: Bringing AI into the Physical World |
| 2503.20215 | Qwen2.5-Omni Technical Report |
| 2503.20314 | Wan: Open and Advanced Large-Scale Video Generative Models |
| 2503.21322 | HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation |
| 2503.21614 | A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond |
| 2503.23278 | Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions |
| 2504.00254 | ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning |
| 2504.00891 | GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning |
| 2504.04011 | Foundation Models for Time Series: A Survey |
| 2504.04423 | UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding |
| 2504.05118 | VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks |
| 2504.05352 | Achieving Binary Weight and Activation for LLMs using Post-Training Quantization |
| 2504.07164 | R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents |
| 2504.08247 | Millions of States: Designing a Scalable MoE Architecture with RWKV-7 Meta-learner |
| 2504.08528 | On The Landscape of Spoken Language Models: A Comprehensive Survey |
| 2504.10479 | InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models |
| 2504.10612 | Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling |
| 2504.11343 | A Minimalist Approach to LLM Reasoning: from Rejection Sampling to REINFORCE |
| 2504.11354 | Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning |
| 2504.12216 | d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning |
| 2504.12285 | BitNet b1.58 2B4T Technical Report |
| 2504.12637 | Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation |
| 2504.13837 | Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? |
| 2504.15573 | Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction |
| 2504.16054 | π₀.₅: A Vision-Language-Action Model with Open-World Generalization |
| 2504.16084 | TTRL: Test-Time Reinforcement Learning |
| 2504.16828 | Process Reward Models That Think |
| 2504.18415 | BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs |
| 2504.20571 | Reinforcement Learning for Reasoning in Large Language Models with One Training Example |
| 2504.21233 | Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math |
| 2504.21318 | Phi-4-reasoning Technical Report |
| 2504.21463 | RWKV-X: A Linear Complexity Hybrid Language Model |
| 2504.21801 | DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition |
| 2505.01420 | Evaluating Frontier Models for Stealth and Situational Awareness |
| 2505.02567 | Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities |
| 2505.02665 | A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law |
| 2505.08827 | RLSR: Reinforcement Learning from Self Reward |
| 2505.09388 | Qwen3 Technical Report |
| 2505.11831 | ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems |
| 2505.12435 | SGDPO: Self-Guided Direct Preference Optimization for Language Model Alignment |
| 2505.13447 | Mean Flows for One-step Generative Modeling |
| 2505.14357 | Vid2World: Crafting Video Diffusion Models to Interactive World Models |
| 2505.14415 | Table Foundation Models: on knowledge pre-training for tabular learning |
| 2505.14432 | Rank-K: Test-Time Reasoning for Listwise Reranking |
| 2505.14683 | Emerging Properties in Unified Multimodal Pretraining |
| 2505.15116 | Graph Foundation Models: A Comprehensive Survey |
| 2505.16324 | From Prediction to Perfection: Introducing Refinement to Autoregressive Image Generation |
| 2505.16831 | Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs |
| 2505.16933 | LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning |
| 2505.16944 | AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios |
| 2505.18774 | Disentangling Knowledge Representations for Large Language Model Editing |
| 2505.19115 | FP4 All the Way: Fully Quantized Training of LLMs |
| 2505.19770 | Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO |
| 2505.20003 | TabPFN: One Model to Rule Them All? |
| 2505.20171 | Long-Context State-Space Video World Models |
| 2505.20347 | SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data |
| 2505.21444 | Can Large Reasoning Models Self-Train? |
| 2505.21996 | VRAG: Learning World Models for Interactive Video Generation |
| 2505.22179 | Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design |
| 2505.22323 | Advancing Expert Specialization for Better MoE |
| 2505.22560 | Geometric Hyena Networks for Large-scale Equivariant Learning |
| 2505.22922 | Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking |
| 2505.23884 | Test-Time Training Done Right |
| 2506.00045 | ACE-Step: A Step Towards Music Generation Foundation Model |
| 2506.00054 | Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers |
| 2506.00477 | Flashbacks to Harmonize Stability and Plasticity in Continual Learning |
| 2506.01963 | Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons |
| 2506.02096 | SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis |
| 2506.03320 | The Future of Continual Learning in the Era of Foundation Models: Three Key Directions |
| 2506.03951 | Rethinking the Stability-Plasticity Trade-off in Continual Learning from an Architectural Perspective |
| 2506.05176 | Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models |
| 2506.05584 | TabFlex: Scaling Tabular Learning to Millions with Linear Attention |
| 2506.09227 | SoK: Machine Unlearning for Large Language Models |
| 2506.09985 | V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning |
| 2506.11687 | Differential Privacy in Machine Learning: A Survey from Symbolic AI to LLMs |
| 2506.12286 | The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason |
| 2506.12928 | Scaling Test-time Compute for LLM Agents |
| 2506.14098 | Toward a Graph Foundation Model: Pre-Training Transformers With Random Walks |
| 2506.14245 | Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs |
| 2506.15742 | FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space |
| 2506.17298 | Mercury: Ultra-Fast Language Models Based on Diffusion |
| 2506.17671 | TPTT: Transforming Pretrained Transformers into Titans |
| 2506.20743 | A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools |
| 2506.21328 | Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts |
| 2506.23589 | Transition Matching: Scalable and Flexible Generative Modeling |
| 2507.02076 | Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs |
| 2507.04771 | Efficient Unlearning with Privacy Guarantees |
| 2507.06457 | A Systematic Analysis of Hybrid Linear Attention |
| 2507.09404 | Scaling Laws for Optimal Data Mixtures |
| 2507.10085 | Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning |
| 2507.11005 | AdaMuon: Adaptive Muon Optimizer |
| 2507.15855 | Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline |
| 2507.17702 | Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models |
| 2507.17801 | Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling |
| 2507.20198 | A Survey of Token Compression for Efficient Multimodal Large Language Models |
| 2508.03613 | Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction |
| 2508.06743 | Analysis of Schedule-Free Nonconvex Optimization |
| 2508.06924 | AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning |
| 2508.10104 | DINOv3: Self-supervised learning for vision at unprecedented scale |
| 2508.13730 | On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions |
| 2508.18265 | InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency |
| 2509.00691 | CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders |
| 2509.01440 | Benchmarking Optimizers for Large Language Model Pretraining |
| 2509.02547 | The Landscape of Agentic Reinforcement Learning for LLMs: A Survey |
| 2509.03378 | Understanding and Improving Shampoo and SOAP via Kullback–Leibler Minimization |
| 2509.04474 | Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling |
| 2509.06457 | Seasonal forecasting using the GenCast probabilistic machine learning model |
| 2509.09679 | ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms |
| 2509.09734 | MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools |
| 2509.12539 | LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations |
| 2509.12892 | Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings |
| 2509.16941 | SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? |
| 2509.21318 | SD3.5-Flash: Distribution-Guided Distillation of Generative Flows |
| 2509.23045 | Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents |
| 2509.23314 | Two-Scale Latent Dynamics for Recurrent-Depth Transformers |
| 2509.23661 | LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training |
| 2509.23678 | Towards a Comprehensive Scaling Law of Mixture-of-Experts |
| 2509.23933 | Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms |
| 2509.24389 | LLaDA-MoE: A Sparse MoE Diffusion Language Model |
| 2509.24510 | Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models |
| 2509.24526 | CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models |
| 2509.25127 | Score Distillation of Flow Matching Models |
| 2509.25373 | From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models |
| 2510.00742 | How Foundational are Foundation Models for Time Series Forecasting? |
| 2510.01631 | Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls |
| 2510.02259 | Transformers Discover Molecular Structure Without Graph Priors |
| 2510.02300 | Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models |
| 2510.02917 | Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders |
| 2510.03313 | Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining |
| 2510.03342 | Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer |
| 2510.03567 | Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs |
| 2510.04147 | Self Speculative Decoding for Diffusion Large Language Models |
| 2510.05364 | The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures |
| 2510.05491 | NorMuon: Making Muon more efficient and scalable |
| 2510.09586 | Vision Language Models: A Survey of 26K Papers |
| 2510.10223 | You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs |
| 2510.13003 | OPLoRA: Orthogonal Projection LoRA Prevents Catastrophic Forgetting during Parameter-Efficient Fine-Tuning |
| 2510.13169 | Universally Invariant Learning in Equivariant GNNs |
| 2510.13721 | NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching |
| 2510.15821 | Chronos-2: From Univariate to Universal Forecasting |
| 2510.17896 | Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism |
| 2510.18471 | CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment |
| 2510.21204 | Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models |
| 2510.22733 | E2Rank: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker |
| 2510.27072 | Towards Understanding Self-play for LLM Reasoning |
| 2511.00040 | Semi-Supervised Preference Optimization with Limited Feedback |
| 2511.01695 | Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding |
| 2511.01815 | KV Cache Transform Coding for Compact Storage in LLM Inference |
| 2511.03690 | The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents |
| 2511.07328 | Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training |
| 2511.09057 | PAN: A World Model for General, Interactable, and Long-Horizon World Simulation |
| 2511.11698 | Moirai 2.0: When Less Is More for Time Series Forecasting |
| 2511.11707 | FSC-Net: Fast-Slow Consolidation Networks for Continual Learning |
| 2511.12181 | MixAR: Mixture Autoregressive Image Generation |
| 2511.12347 | VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing |
| 2511.15375 | Parameter Importance-Driven Continual Learning for Foundation Models |
| 2511.15992 | Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis |
| 2511.18397 | Natural Emergent Misalignment from Reward Hacking in Production RL |
| 2511.18936 | SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression |
| 2511.21437 | A Systematic Study of In-the-Wild Model Merging for Large Language Models |
| 2511.22009 | StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation |
| 2511.22570 | DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning |
| 2511.22699 | Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer |
| 2512.04268 | The Initialization Determines Whether In-Context Learning Is Gradient Descent |
| 2512.05084 | Gradient Descent with Provably Tuned Learning-rate Schedules |
| 2512.05534 | A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima |
| 2512.05817 | Utility Boundary of Dataset Distillation: Scaling and Configuration-Coverage Laws |
| 2512.05916 | KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity |
| 2512.10858 | Scaling Behavior of Discrete Diffusion Language Models |
| 2512.15657 | SoFlow: Solution Flow Models for One-Step Generative Modeling |
| 2512.18470 | SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios |
| 2512.20957 | One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents |
| 2512.23675 | End-to-End Test-Time Training for Long Context |
| 2601.03774 | Scalable Machine Learning Force Fields for Macromolecular Systems Through Long-Range Aware Message Passing |
| 2601.04823 | DR-LoRA: Dynamic Rank LoRA for Fine-Tuning Mixture-of-Experts Models |
| 2601.10904 | ARC Prize 2025: Technical Report |
| 2601.12560 | Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation of Large Language Model Agents |
| 2601.22156 | Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts |
| 2602.01357 | Your Self-Play Algorithm is Secretly an Adversarial Imitator: Understanding LLM Self-Play through the Lens of Imitation Learning |
| 2602.02571 | Trajectory Consistency for One-Step Generation on Euler Mean Flows |
| 2602.03442 | A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces |
| 2602.04768 | Billion-Scale Graph Foundation Models |
| 2602.11139 | TabICLv2: A better, faster, scalable, and open tabular foundation model |
| 2602.20117 | ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models |
| 2603.01639 | Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning |
| 2603.03597 | NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training |
| 2603.05168 | Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity |
| 2603.09938 | Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions |
| 2603.12658 | Continual Learning in Large Language Models: Methods, Challenges, and Opportunities |
| 2603.13372 | The ARC of Progress towards AGI: A Living Survey of Abstraction and Reasoning |
| 2603.15569 | Mamba-3: Improved Sequence Modeling using State Space Principles |
| 2603.25248 | ColBERT-Att: Late-Interaction Meets Attention for Enhanced Retrieval |
| 2604.01411 | Test-Time Scaling Makes Overtraining Compute-Optimal |
| 2604.07615 | ADAG: Automatically Describing Attribution Graphs |
| 2604.08178 | Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling |
| 2604.19089 | Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression |
| 2604.20329 | Image Generators are Generalist Vision Learners |
| 2604.24618 | Evaluating whether AI models would sabotage AI safety research |
| 2605.06676 | LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction |
| 2605.22791 | Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention |
| 2605.25979 | LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence |
- Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad — deepmind.google — https://deepmind.google/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/
- Bring your ideas to life: Veo 2 video generation available for developers — developers.googleblog.com — https://developers.googleblog.com/veo-2-video-generation-now-generally-available/
- Chai-1: Decoding the molecular interactions of life — biorxiv.org — https://www.biorxiv.org/content/10.1101/2024.10.10.615955v1
- Circuit Tracing: Revealing Computational Graphs in Language Models — transformer-circuits.pub — https://transformer-circuits.pub/2025/attribution-graphs/methods.html
- From U-Nets to DiTs: The Architectural Evolution of Text-to-Image Diffusion Models (2021–2025) — iclr-blogposts.github.io — https://iclr-blogposts.github.io/2026/blog/2026/diffusion-architecture-evolution/
- Gemini Diffusion — Google DeepMind — deepmind.google — https://deepmind.google/models/gemini-diffusion/
- Genie 2: A Large-Scale Foundation World Model — deepmind.google — https://deepmind.google/blog/genie-2-a-large-scale-foundation-world-model/
- Genie 3: A New Frontier for World Models — deepmind.google — https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/
- Genome modeling and design across all domains of life with Evo 2 — biorxiv.org — https://www.biorxiv.org/content/10.1101/2025.02.18.638918v1
- Mochi 1: A new SOTA in open text-to-video — genmo.ai — https://www.genmo.ai/blog/mochi-1-a-new-sota-in-open-text-to-video
- On the Biology of a Large Language Model — transformer-circuits.pub — https://transformer-circuits.pub/2025/attribution-graphs/biology.html
- Reward Hacking in Reinforcement Learning — lilianweng.github.io — https://lilianweng.github.io/posts/2024-11-28-reward-hacking/
- Simulating 500 million years of evolution with a language model — biorxiv.org — https://www.biorxiv.org/content/10.1101/2024.07.01.600583v1
- SmolLM3: smol, multilingual, long-context reasoner — huggingface.co — https://huggingface.co/blog/smollm3
- The State Of LLMs 2025: Progress, Progress, and Predictions — magazine.sebastianraschka.com — https://magazine.sebastianraschka.com/p/state-of-llms-2025
- The State of Reinforcement Learning for LLM Reasoning — magazine.sebastianraschka.com — https://magazine.sebastianraschka.com/p/the-state-of-llm-reasoning-model-training
- TimesFM 2.5: Smaller, Longer-Context Foundation Model Leading GIFT-Eval — huggingface.co — https://huggingface.co/google/timesfm-2.5-200m-pytorch
- Titans + MIRAS: Helping AI have long-term memory — research.google — https://research.google/blog/titans-miras-helping-ai-have-long-term-memory/
- Why We Think — lilianweng.github.io — https://lilianweng.github.io/posts/2025-05-01-thinking/
- π₀ and π₀-FAST: Vision-Language-Action Models for General Robot Control — huggingface.co — https://huggingface.co/blog/pi0