Skip to content

Add tutorial: Qwen3.6-27B with MTP (Multi-Token Prediction) on Jetson Thor / AGX Orin for local agentic coding #397

@tokk-nv

Description

@tokk-nv

Community signal

Multiple high-engagement posts this week show massive interest in Qwen3.6-27B with MTP (Multi-Token Prediction) for local agentic coding:

The narrative is clear: Qwen3.6-27B + MTP + a 48GB budget = first viable local replacement for Claude Code / Codex at 262k context.

Why this matters for Jetson

This is a perfect fit for Jetson Thor (128GB) and AGX Orin 64GB — the memory and bandwidth make 27B dense at Q4–Q8 with speculative/MTP decoding a headline use case. Jetson AI Lab already has a Qwen3.6 27B model card but does not cover the MTP draft-model flow, which is what's unlocking 2.5x throughput and making agentic coding usable locally.

Suggested tutorial scope

  • Build llama.cpp with the MTP PR (#22673) on JetPack 7.x for Thor and JetPack 6.x for AGX Orin
  • Run Qwen3.6-27B Q4_K_XL / Q5_K_XL / Q8 with MTP draft head; measure tok/s on Thor, AGX Orin 64GB, and Orin NX 16GB (where feasible)
  • Compare against non-MTP baseline and against Gemma 4 31B MTP for agentic coding
  • Wire up drop-in OpenAI / Anthropic API endpoints so users can plug it into OpenCode, Aider, Continue.dev
  • Report 262k-context memory footprint, prefill latency, and slot-reuse tricks (see --slots trick from the Ralph-loop post)
  • Include quality-vs-quant comparison aligned with the community's BF16/Q8/Q6/Q4/IQ4/IQ3 matrix

Filed by JetsonPulse

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions