Skip to content

Latest commit

 

History

History
481 lines (457 loc) · 38.6 KB

File metadata and controls

481 lines (457 loc) · 38.6 KB

Sources & Attribution

This corpus is a curation of publicly available educational material. Every piece of content belongs to its original creator. This file credits every source included. Please support these creators directly.


Courses & Lecture Series (YouTube transcripts)

Stanford University

  • CS229 — Machine Learning (Andrew Ng) — 20 lectures
  • CS230 — Deep Learning (Andrew Ng) — 9 lectures
  • CS231n — CNNs for Visual Recognition — 14 lectures
  • CS224n — NLP with Deep Learning — 46 lectures
  • CS25 — Transformers United — 39 lectures
  • CS236 — Deep Generative Models — 15 lectures
  • CS336 — Language Modeling from Scratch — 15 lectures

MIT

  • 6.S191 — Introduction to Deep Learning — 86 lectures

Independent educators & organizations

  • Andrej Karpathy — main channel + Neural Networks: Zero to Hero — 25 lectures
  • 3Blue1BrownNeural Networks series (Grant Sanderson) — 9 videos
  • fast.aiPractical Deep Learning for Coders (Jeremy Howard) — 48 lectures
  • DeepLearning.AI — 49 videos
  • Yannic Kilcher — paper walkthroughs & ML news — 99 videos

Transcripts are auto-generated captions reproduced for research/educational use. All teaching credit is the instructors' and institutions' own. Please watch the originals and subscribe.


Web Articles & Blogs

  • Jay Alammar (jalammar.github.io) — The Illustrated Transformer; The Illustrated GPT-2; The Illustrated BERT/ELMo; The Illustrated Stable Diffusion; Visualizing Seq2seq with Attention
  • Lilian Weng (lilianweng.github.io) — Attention? Attention!; The Transformer Family v2.0; What are Diffusion Models?; From Autoencoder to Beta-VAE; From GAN to WGAN; Policy Gradient Algorithms; Prompt Engineering; LLM Powered Autonomous Agents
  • Sebastian Raschka (magazine.sebastianraschka.com) — Understanding Large Language Models; Understanding Reasoning LLMs; Improving LoRA (DoRA) from Scratch
  • Andrej Karpathy (karpathy.github.io / medium) — A Recipe for Training Neural Networks; The Unreasonable Effectiveness of RNNs; Yes you should understand backprop
  • Stanford CS231n notes (cs231n.github.io) — course note pages
  • Dive into Deep Learning (d2l.ai) — selected chapters
  • Distill.pub — A Gentle Introduction to Graph Neural Networks; Understanding Convolutions on Graphs; The Building Blocks of Interpretability
  • Anthropic — Transformer Circuits (transformer-circuits.pub) — A Mathematical Framework for Transformer Circuits; Toy Models of Superposition

Research Papers — originally curated, full text (arXiv — 78)

All papers retain their arXiv ID and URL in the file frontmatter. Distributed under arXiv's non-exclusive license terms; rights remain with the authors. 313 further recent (2024H2–2026) papers and 20 web articles were later added — see Recent additions at the end of this file; each is credited by arXiv ID / URL and authors in its own file frontmatter.

arXiv ID Title
1207.0580 Improving neural networks by preventing co-adaptation of feature detectors (Dropout)
1301.3781 Efficient Estimation of Word Representations in Vector Space (word2vec)
1310.4546 Distributed Representations of Words and Phrases and their Compositionality
1312.6114 Auto-Encoding Variational Bayes (VAE)
1406.2661 Generative Adversarial Networks
1409.0473 Neural Machine Translation by Jointly Learning to Align and Translate
1409.1556 Very Deep Convolutional Networks (VGG)
1409.3215 Sequence to Sequence Learning with Neural Networks
1409.4842 Going Deeper with Convolutions (GoogLeNet)
1412.6980 Adam: A Method for Stochastic Optimization
1502.03167 Batch Normalization
1506.01497 Faster R-CNN
1512.03385 Deep Residual Learning for Image Recognition (ResNet)
1607.06450 Layer Normalization
1608.06993 Densely Connected Convolutional Networks (DenseNet)
1701.06538 Outrageously Large Neural Networks (Sparsely-Gated MoE)
1706.03762 Attention Is All You Need (Transformer)
1707.06347 Proximal Policy Optimization Algorithms (PPO)
1711.00937 Neural Discrete Representation Learning (VQ-VAE)
1804.02767 YOLOv3: An Incremental Improvement
1810.04805 BERT
1812.04948 A Style-Based Generator Architecture for GANs (StyleGAN)
1907.11692 RoBERTa
1910.10683 Exploring the Limits of Transfer Learning (T5)
1911.02150 Fast Transformer Decoding: One Write-Head is All You Need (MQA)
2001.04451 Reformer: The Efficient Transformer
2001.08361 Scaling Laws for Neural Language Models
2004.04906 Dense Passage Retrieval for Open-Domain QA
2004.05150 Longformer: The Long-Document Transformer
2005.11401 Retrieval-Augmented Generation for Knowledge-Intensive NLP (RAG)
2005.12872 End-to-End Object Detection with Transformers (DETR)
2005.14165 Language Models are Few-Shot Learners (GPT-3)
2006.04768 Linformer: Self-Attention with Linear Complexity
2006.11239 Denoising Diffusion Probabilistic Models (DDPM)
2009.03300 Measuring Massive Multitask Language Understanding (MMLU)
2009.14794 Rethinking Attention with Performers
2010.02502 Denoising Diffusion Implicit Models (DDIM)
2010.11929 An Image is Worth 16x16 Words (ViT)
2101.03961 Switch Transformers
2103.00020 Learning Transferable Visual Models From Natural Language Supervision (CLIP)
2104.09864 RoFormer: Rotary Position Embedding (RoPE)
2106.09685 LoRA: Low-Rank Adaptation of Large Language Models
2107.03374 Evaluating Large Language Models Trained on Code (Codex)
2108.12409 Train Short, Test Long: Attention with Linear Biases (ALiBi)
2111.06377 Masked Autoencoders Are Scalable Vision Learners (MAE)
2112.10752 High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion)
2201.11903 Chain-of-Thought Prompting
2203.02155 Training language models to follow instructions with human feedback (InstructGPT)
2203.11171 Self-Consistency Improves Chain of Thought Reasoning
2203.15556 Training Compute-Optimal Large Language Models (Chinchilla)
2204.06125 Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL·E 2)
2205.14135 FlashAttention
2206.04615 Beyond the Imitation Game (BIG-bench)
2208.07339 LLM.int8()
2210.03629 ReAct: Synergizing Reasoning and Acting in Language Models
2210.17323 GPTQ
2212.09748 Scalable Diffusion Models with Transformers (DiT)
2302.01318 Accelerating LLM Decoding with Speculative Sampling
2302.04761 Toolformer
2302.13971 LLaMA: Open and Efficient Foundation Language Models
2305.10601 Tree of Thoughts
2305.13245 GQA: Grouped-Query Attention
2305.14314 QLoRA
2305.18290 Direct Preference Optimization (DPO)
2306.00978 AWQ: Activation-aware Weight Quantization
2307.08691 FlashAttention-2
2307.09288 Llama 2
2309.06180 Efficient Memory Management for LLM Serving (PagedAttention / vLLM)
2310.06825 Mistral 7B
2312.00752 Mamba: Linear-Time Sequence Modeling with Selective State Spaces
2401.04088 Mixtral of Experts
2404.02258 Mixture-of-Depths
2405.04434 DeepSeek-V2
2405.21060 Transformers are SSMs (State Space Duality / Mamba-2)
2407.08608 FlashAttention-3
2412.19437 DeepSeek-V3 Technical Report
2501.12948 DeepSeek-R1
2502.11089 Native Sparse Attention

A note to creators

If you are the author of any included material and would like it modified or removed, please open an issue or contact the repository owner. See NOTICE.md. Requests will be honored promptly.


Recent additions (2024H2–2026)

Added in a later update to keep the library current. 313 arXiv papers (stored as verbatim abstract + metadata; read the full paper at the linked source) and 20 web articles. Every item is credited by its source URL and authors in its own file frontmatter; this section lists them for attribution.

arXiv papers (313)

arXiv ID Title
2408.07666 Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
2409.08239 Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
2409.11402 NVLM: Open Frontier-Class Multimodal LLMs
2409.16040 Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts
2410.00037 Moshi: a speech-text foundation model for real-time dialogue
2410.02694 HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly
2410.05779 LightRAG: Simple and Fast Retrieval-Augmented Generation
2410.06293 Accelerated Preference Optimization for Large Language Model Alignment
2410.06885 F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
2410.10393 GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation
2410.10469 Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts
2410.12557 One Step Diffusion via Shortcut Models
2410.14949 On the Convergence and Straightness of Rectified Flow
2410.15595 A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications
2410.16714 Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment
2410.20285 SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement
2410.21357 Energy-Based Diffusion Language Models for Text Generation
2410.24164 π₀: A Vision-Language-Action Flow Model for General Robot Control
2411.04872 FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
2411.07975 JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
2411.13676 Hymba: A Hybrid-head Architecture for Small Language Models
2411.14347 DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
2411.15242 The Zamba2 Suite: Technical Report
2411.18674 Active Data Curation Effectively Distills Large-Scale Multimodal Models
2412.03555 PaliGemma 2: A Family of Versatile VLMs for Transfer
2412.03603 HunyuanVideo: A Systematic Framework For Large Video Generative Models
2412.04984 Frontier Models are Capable of In-context Scheming
2412.06464 Gated Delta Networks: Improving Mamba2 with Delta Rule
2412.08905 Phi-4 Technical Report
2412.10117 CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
2412.14093 Alignment Faking in Large Language Models
2412.15115 Qwen2.5 Technical Report
2412.16441 Towards Graph Foundation Models: Learning Generalities Across Graphs via Task-Trees
2412.16906 Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation
2412.19048 Jasper and Stella: distillation of SOTA embedding models
2501.00656 2 OLMo 2 Furious
2501.00663 Titans: Learning to Memorize at Test Time
2501.03575 Cosmos World Foundation Model Platform for Physical AI
2501.06322 Multi-Agent Collaboration Mechanisms: A Survey of LLMs
2501.07278 Lifelong Learning of Large Language Model based Agents: A Roadmap
2501.08313 MiniMax-01: Scaling Foundation Models with Lightning Attention
2501.09136 Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG
2501.11873 Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
2501.12273 Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement
2501.15103 Each Rank Could be an Expert: Single-Ranked Mixture of Experts LoRA for Multi-Task Learning
2501.15383 Qwen2.5-1M Technical Report
2501.17116 Optimizing Large Language Model Training Using FP4 Quantization
2501.17315 A sketch of an AI control safety case
2501.17811 Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
2501.18823 Transcoders Beat Sparse Autoencoders for Interpretability
2501.19393 s1: Simple Test-Time Scaling
2502.00883 SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters
2502.01113 GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation
2502.01636 Lifelong Knowledge Editing requires Better Regularization
2502.02672 Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes
2502.02737 SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model
2502.05171 Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
2502.05172 Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient
2502.05564 TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
2502.06766 Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs
2502.07272 GENERator: A Long-Context Generative Genomic Foundation Model
2502.07640 Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving
2502.07864 TransMLA: Multi-Head Latent Attention Is All You Need
2502.08606 Distillation Scaling Laws
2502.09638 Jailbreaking to Jailbreak
2502.09992 Large Language Diffusion Models
2502.10248 Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
2502.10297 DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products
2502.10436 MERGE³: Efficient Evolutionary Merging on Consumer-grade GPUs
2502.12118 Scaling Test-Time Compute Without Verification or RL is Suboptimal
2502.12853 S2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
2502.13178 Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis
2502.13189 MoBA: Mixture of Block Attention for Long-Context LLMs
2502.13595 MMTEB: Massive Multilingual Text Embedding Benchmark
2502.13923 Qwen2.5-VL Technical Report
2502.14420 ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model
2502.14837 Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
2502.15304 SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention
2502.15592 Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning
2502.15681 One-step Diffusion Models with f-Divergence Distribution Matching
2502.15828 A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models
2502.16894 Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment (GOAT)
2502.16982 Muon is Scalable for LLM Training
2502.17421 LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
2502.17521 Recent Advances in Large Language Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
2502.18418 Rank1: Test-Time Compute for Reranking in Information Retrieval
2502.19645 Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
2502.20082 LongRoPE2: Near-Lossless LLM Context Window Scaling
2502.21321 LLM Post-Training: A Deep Dive into Reasoning Large Language Models
2503.00030 RSPO: Regularized Self-Play Alignment of Large Language Models
2503.01743 Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
2503.01840 EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
2503.01854 A Comprehensive Survey of Machine Unlearning Techniques for Large Language Models
2503.03746 Process-based Self-Rewarding Language Models
2503.06639 Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification
2503.08099 Whoever Started the Interference Should End It: Guiding Data-Free Model Merging via Task Vectors
2503.09532 SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
2503.09573 Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
2503.09642 Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k
2503.10677 A Survey on Knowledge-Oriented Retrieval-Augmented Generation
2503.11251 Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
2503.12434 A Survey on the Optimization of Large Language Model-based Agents
2503.13436 Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
2503.14456 RWKV-7 "Goose" with Expressive Dynamic State Evolution
2503.14476 DAPO: An Open-Source LLM Reinforcement Learning System at Scale
2503.14734 GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
2503.18102 AgentRxiv: Towards Collaborative Autonomous Research
2503.18893 xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction
2503.18970 Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State-Space Architectures from S4 to Mamba
2503.19551 Scaling Laws of Synthetic Data for Language Models
2503.19786 Gemma 3 Technical Report
2503.20018 Experience Replay Addresses Loss of Plasticity in Continual Learning
2503.20020 Gemini Robotics: Bringing AI into the Physical World
2503.20215 Qwen2.5-Omni Technical Report
2503.20314 Wan: Open and Advanced Large-Scale Video Generative Models
2503.21322 HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation
2503.21614 A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond
2503.23278 Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions
2504.00254 ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning
2504.00891 GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning
2504.04011 Foundation Models for Time Series: A Survey
2504.04423 UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
2504.05118 VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
2504.05352 Achieving Binary Weight and Activation for LLMs using Post-Training Quantization
2504.07164 R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents
2504.08247 Millions of States: Designing a Scalable MoE Architecture with RWKV-7 Meta-learner
2504.08528 On The Landscape of Spoken Language Models: A Comprehensive Survey
2504.10479 InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
2504.10612 Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling
2504.11343 A Minimalist Approach to LLM Reasoning: from Rejection Sampling to REINFORCE
2504.11354 Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning
2504.12216 d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
2504.12285 BitNet b1.58 2B4T Technical Report
2504.12637 Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation
2504.13837 Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
2504.15573 Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction
2504.16054 π₀.₅: A Vision-Language-Action Model with Open-World Generalization
2504.16084 TTRL: Test-Time Reinforcement Learning
2504.16828 Process Reward Models That Think
2504.18415 BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
2504.20571 Reinforcement Learning for Reasoning in Large Language Models with One Training Example
2504.21233 Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math
2504.21318 Phi-4-reasoning Technical Report
2504.21463 RWKV-X: A Linear Complexity Hybrid Language Model
2504.21801 DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition
2505.01420 Evaluating Frontier Models for Stealth and Situational Awareness
2505.02567 Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
2505.02665 A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law
2505.08827 RLSR: Reinforcement Learning from Self Reward
2505.09388 Qwen3 Technical Report
2505.11831 ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
2505.12435 SGDPO: Self-Guided Direct Preference Optimization for Language Model Alignment
2505.13447 Mean Flows for One-step Generative Modeling
2505.14357 Vid2World: Crafting Video Diffusion Models to Interactive World Models
2505.14415 Table Foundation Models: on knowledge pre-training for tabular learning
2505.14432 Rank-K: Test-Time Reasoning for Listwise Reranking
2505.14683 Emerging Properties in Unified Multimodal Pretraining
2505.15116 Graph Foundation Models: A Comprehensive Survey
2505.16324 From Prediction to Perfection: Introducing Refinement to Autoregressive Image Generation
2505.16831 Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
2505.16933 LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
2505.16944 AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios
2505.18774 Disentangling Knowledge Representations for Large Language Model Editing
2505.19115 FP4 All the Way: Fully Quantized Training of LLMs
2505.19770 Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
2505.20003 TabPFN: One Model to Rule Them All?
2505.20171 Long-Context State-Space Video World Models
2505.20347 SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data
2505.21444 Can Large Reasoning Models Self-Train?
2505.21996 VRAG: Learning World Models for Interactive Video Generation
2505.22179 Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
2505.22323 Advancing Expert Specialization for Better MoE
2505.22560 Geometric Hyena Networks for Large-scale Equivariant Learning
2505.22922 Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking
2505.23884 Test-Time Training Done Right
2506.00045 ACE-Step: A Step Towards Music Generation Foundation Model
2506.00054 Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers
2506.00477 Flashbacks to Harmonize Stability and Plasticity in Continual Learning
2506.01963 Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons
2506.02096 SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
2506.03320 The Future of Continual Learning in the Era of Foundation Models: Three Key Directions
2506.03951 Rethinking the Stability-Plasticity Trade-off in Continual Learning from an Architectural Perspective
2506.05176 Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
2506.05584 TabFlex: Scaling Tabular Learning to Millions with Linear Attention
2506.09227 SoK: Machine Unlearning for Large Language Models
2506.09985 V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
2506.11687 Differential Privacy in Machine Learning: A Survey from Symbolic AI to LLMs
2506.12286 The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason
2506.12928 Scaling Test-time Compute for LLM Agents
2506.14098 Toward a Graph Foundation Model: Pre-Training Transformers With Random Walks
2506.14245 Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs
2506.15742 FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
2506.17298 Mercury: Ultra-Fast Language Models Based on Diffusion
2506.17671 TPTT: Transforming Pretrained Transformers into Titans
2506.20743 A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools
2506.21328 Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts
2506.23589 Transition Matching: Scalable and Flexible Generative Modeling
2507.02076 Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs
2507.04771 Efficient Unlearning with Privacy Guarantees
2507.06457 A Systematic Analysis of Hybrid Linear Attention
2507.09404 Scaling Laws for Optimal Data Mixtures
2507.10085 Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning
2507.11005 AdaMuon: Adaptive Muon Optimizer
2507.15855 Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline
2507.17702 Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models
2507.17801 Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling
2507.20198 A Survey of Token Compression for Efficient Multimodal Large Language Models
2508.03613 Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
2508.06743 Analysis of Schedule-Free Nonconvex Optimization
2508.06924 AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning
2508.10104 DINOv3: Self-supervised learning for vision at unprecedented scale
2508.13730 On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions
2508.18265 InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
2509.00691 CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders
2509.01440 Benchmarking Optimizers for Large Language Model Pretraining
2509.02547 The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
2509.03378 Understanding and Improving Shampoo and SOAP via Kullback–Leibler Minimization
2509.04474 Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling
2509.06457 Seasonal forecasting using the GenCast probabilistic machine learning model
2509.09679 ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
2509.09734 MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools
2509.12539 LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations
2509.12892 Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings
2509.16941 SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
2509.21318 SD3.5-Flash: Distribution-Guided Distillation of Generative Flows
2509.23045 Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents
2509.23314 Two-Scale Latent Dynamics for Recurrent-Depth Transformers
2509.23661 LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
2509.23678 Towards a Comprehensive Scaling Law of Mixture-of-Experts
2509.23933 Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms
2509.24389 LLaDA-MoE: A Sparse MoE Diffusion Language Model
2509.24510 Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models
2509.24526 CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models
2509.25127 Score Distillation of Flow Matching Models
2509.25373 From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
2510.00742 How Foundational are Foundation Models for Time Series Forecasting?
2510.01631 Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls
2510.02259 Transformers Discover Molecular Structure Without Graph Priors
2510.02300 Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models
2510.02917 Mechanistic Interpretability of Code Correctness in LLMs via Sparse Autoencoders
2510.03313 Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining
2510.03342 Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer
2510.03567 Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs
2510.04147 Self Speculative Decoding for Diffusion Large Language Models
2510.05364 The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures
2510.05491 NorMuon: Making Muon more efficient and scalable
2510.09586 Vision Language Models: A Survey of 26K Papers
2510.10223 You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs
2510.13003 OPLoRA: Orthogonal Projection LoRA Prevents Catastrophic Forgetting during Parameter-Efficient Fine-Tuning
2510.13169 Universally Invariant Learning in Equivariant GNNs
2510.13721 NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching
2510.15821 Chronos-2: From Univariate to Universal Forecasting
2510.17896 Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism
2510.18471 CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
2510.21204 Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models
2510.22733 E2Rank: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker
2510.27072 Towards Understanding Self-play for LLM Reasoning
2511.00040 Semi-Supervised Preference Optimization with Limited Feedback
2511.01695 Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding
2511.01815 KV Cache Transform Coding for Compact Storage in LLM Inference
2511.03690 The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents
2511.07328 Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training
2511.09057 PAN: A World Model for General, Interactable, and Long-Horizon World Simulation
2511.11698 Moirai 2.0: When Less Is More for Time Series Forecasting
2511.11707 FSC-Net: Fast-Slow Consolidation Networks for Continual Learning
2511.12181 MixAR: Mixture Autoregressive Image Generation
2511.12347 VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
2511.15375 Parameter Importance-Driven Continual Learning for Foundation Models
2511.15992 Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis
2511.18397 Natural Emergent Misalignment from Reward Hacking in Production RL
2511.18936 SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression
2511.21437 A Systematic Study of In-the-Wild Model Merging for Large Language Models
2511.22009 StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation
2511.22570 DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
2511.22699 Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
2512.04268 The Initialization Determines Whether In-Context Learning Is Gradient Descent
2512.05084 Gradient Descent with Provably Tuned Learning-rate Schedules
2512.05534 A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima
2512.05817 Utility Boundary of Dataset Distillation: Scaling and Configuration-Coverage Laws
2512.05916 KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity
2512.10858 Scaling Behavior of Discrete Diffusion Language Models
2512.15657 SoFlow: Solution Flow Models for One-Step Generative Modeling
2512.18470 SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
2512.20957 One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents
2512.23675 End-to-End Test-Time Training for Long Context
2601.03774 Scalable Machine Learning Force Fields for Macromolecular Systems Through Long-Range Aware Message Passing
2601.04823 DR-LoRA: Dynamic Rank LoRA for Fine-Tuning Mixture-of-Experts Models
2601.10904 ARC Prize 2025: Technical Report
2601.12560 Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation of Large Language Model Agents
2601.22156 Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts
2602.01357 Your Self-Play Algorithm is Secretly an Adversarial Imitator: Understanding LLM Self-Play through the Lens of Imitation Learning
2602.02571 Trajectory Consistency for One-Step Generation on Euler Mean Flows
2602.03442 A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces
2602.04768 Billion-Scale Graph Foundation Models
2602.11139 TabICLv2: A better, faster, scalable, and open tabular foundation model
2602.20117 ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models
2603.01639 Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning
2603.03597 NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training
2603.05168 Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity
2603.09938 Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions
2603.12658 Continual Learning in Large Language Models: Methods, Challenges, and Opportunities
2603.13372 The ARC of Progress towards AGI: A Living Survey of Abstraction and Reasoning
2603.15569 Mamba-3: Improved Sequence Modeling using State Space Principles
2603.25248 ColBERT-Att: Late-Interaction Meets Attention for Enhanced Retrieval
2604.01411 Test-Time Scaling Makes Overtraining Compute-Optimal
2604.07615 ADAG: Automatically Describing Attribution Graphs
2604.08178 Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling
2604.19089 Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression
2604.20329 Image Generators are Generalist Vision Learners
2604.24618 Evaluating whether AI models would sabotage AI safety research
2605.06676 LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction
2605.22791 Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
2605.25979 LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Web articles (20)