ucb-bar · charleshong3 · Apr 10, 2026 · Jan 24, 2026 · Jan 31, 2026 · Jan 31, 2026
diff --git a/.gitignore b/.gitignore
@@ -177,4 +177,4 @@ output/
 wandb/
 config/
 tmp*/
-temp*/
+temp*/
diff --git a/README.md b/README.md
@@ -16,7 +16,7 @@
 
 **Autocomp** is an extensible, portable framework for LLM-driven kernel optimization across tensor accelerators. Point it at a kernel, pick your hardware target, and Autocomp speeds it up, automatically.
 
-It already delivers strong results across **[AWS Trainium](https://aws.amazon.com/ai/machine-learning/trainium/)**, **[Google TPU](https://cloud.google.com/tpu)**, **NVIDIA GPUs**, **[Gemmini](https://github.com/ucb-bar/gemmini)**, and the **RISC-V Vector Extension**. Need a new target? The **[Agent Builder](autocomp/agent_builder/README.md)** can spin up a hardware-specific optimization agent from your docs in minutes.
+It already delivers strong results across **[AWS Trainium](https://aws.amazon.com/ai/machine-learning/trainium/)**, **[Google TPU](https://cloud.google.com/tpu)**, **[NVIDIA GPUs](https://charleshong3.github.io/blog/autocomp_update.html)**, **[Gemmini](https://github.com/ucb-bar/gemmini)**, and **[RISC-V Vector Processors](https://saturn-vectors.org/)**. Need a new target? The **[Agent Builder](autocomp/agent_builder/README.md)** can spin up a hardware-specific optimization agent from your docs in minutes.
 
 <p align="center">
 <a href="https://arxiv.org/abs/2505.18574"><b>📚 Read the paper</b></a>&nbsp;&nbsp;·&nbsp;&nbsp;<b>✏️ Authors:</b> <a href="https://charleshong3.github.io/">Charles Hong</a>, <a href="https://x.com/sahilb17">Sahil Bhatia</a>, <a href="https://people.eecs.berkeley.edu/~akcheung/">Alvin Cheung</a>, <a href="https://people.eecs.berkeley.edu/~ysshao/">Yakun Sophia Shao</a> (UC Berkeley)
@@ -65,6 +65,8 @@ Each hardware target requires two things: an **optimization agent** that knows h
 | Google TPU | `built:tpu-v6e` (TPU v6e) | `tpu` ([tpu_setup.md](autocomp/backend/tpu/tpu_setup.md)) |
 | Gemmini | `gemmini` | `gemmini` ([gemmini_setup.md](autocomp/backend/gemmini/gemmini_setup.md)) |
 | NVIDIA GPU | `cuda` | `kernelbench` ([kb_setup.md](autocomp/backend/kernelbench/kb_setup.md))<br>`gpumode` ([gpumode_setup.md](autocomp/backend/gpumode/gpumode_setup.md)) |
+| Saturn (RVV) | `built:saturn-rvv` | `saturn` ([saturn_setup.md](autocomp/backend/saturn/saturn_setup.md)) |
+| Saturn + XNNPACK (RVV) | `built:saturn-rvv` | `xnnpack` ([xnnpack_setup.md](autocomp/backend/xnnpack/xnnpack_setup.md)) |
 
 Partially supported hardware targets:
 - RISC-V Vector (RVV) on Canaan Kendryte K230. See `k230` branch for code. As the implementation is very hacky, we do not currently recommend using this hardware target.
@@ -193,11 +195,13 @@ The most important parameters are:
   - `TpuHardwareConfig("v6e-1")`
   - `GemminiHardwareConfig(pe_dim=16, spad_size_kb=256, acc_size_kb=64)`
   - `CudaHardwareConfig("NVIDIA L40S", "2.5.0", "12.4")`
+  - `SaturnHardwareConfig(vlen=512, dlen=256)`
 
 **Evaluation Backend**
-- `backend_name`: The evaluation backend to use. Currently supported values are `trn`, `tpu`, `gemmini`, `kernelbench`, and `gpumode`.
+- `backend_name`: The evaluation backend to use. Currently supported values are `trn`, `tpu`, `gemmini`, `kernelbench`, `gpumode`, `saturn`, and `xnnpack`.
 - `simulator`: The evaluation method to use, if the backend supports multiple. For all others, put `None`.
   - For Gemmini, `spike` (only optimizes instruction counts, not cycle counts) or `firesim`
+  - For Saturn/XNNPACK, `spike` or `firesim`
   - For CUDA/GPU MODE, `gpumode-local` or `gpumode-cli`
 
 **Benchmark**
@@ -207,6 +211,8 @@ The most important parameters are:
   - For Gemmini, `gemm`, `conv`, or `admm-multifunction`.
   - For CUDA/KernelBench, `kb-level1`, `kb-level2`, `kb-level3`, or `kb-level4`.
   - For CUDA/GPU MODE, `gpumode`.
+  - For Saturn, `rvv-f32` or `rvv-qs8`.
+  - For XNNPACK, `xnnpack-f32`.
 - `prob_id`: The problem ID to use.
 
 **Optimization Agent**
@@ -294,6 +300,8 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for more details on how to add tests and
 
 ## 📝 Changelog
 
+**(4/9/2026)** Added [Saturn RVV](https://saturn-vectors.org/) as a new hardware target.
+
 **(4/6/2026)** Renamed `tests/` to `harnesses/` and solution entry point from `test()` to `solution()` for clarity. Improved agent builder logging.
 
 **(4/3/2026)** Added run metrics (runtime and tokens) and updated Trace Visualizer to be self-contained.

diff --git a/autocomp/agent_builder/.built/saturn-rvv/agent_config.yaml b/autocomp/agent_builder/.built/saturn-rvv/agent_config.yaml
@@ -0,0 +1,12 @@
+agent_name: saturn-rvv
+version: '1.0'
+built_at: '2026-03-31T20:55:15+00:00'
+build:
+  main_model: gemini-3.1-pro-preview
+  light_model: gemini-3.1-flash-lite-preview
+  agent_scope: Optimizing RVV (RISC-V Vector) Intrinsics kernel code on Saturn(all code written in rvv intrinsics and not
+    in rvv asm). The agent rewrites single-kernel source code for better performance.
+  sources:
+  - type: directory
+    path: /scratch/kchern2/autocomp-demo/saturn-docs
+description: Auto-generated agent config for saturn-rvv
diff --git a/autocomp/agent_builder/.built/saturn-rvv/architecture.md b/autocomp/agent_builder/.built/saturn-rvv/architecture.md
@@ -0,0 +1,25 @@
+**Hardware Architecture Summary: Saturn RISC-V Vector (RVV) Microarchitecture**
+
+**Overview and Programming Model**
+Saturn is a parameterized, short-vector microarchitecture implementing the RISC-V Vector (RVV) extension, designed primarily for domain-specialized, DSP, and embedded cores. It employs a Decoupled Access-Execute (DAE) design where the Vector Load-Store Unit (VLSU) and Vector Datapath (VU) operate independently. Saturn executes vector instructions strictly post-commit relative to the scalar core, meaning vector instructions are non-speculative. It relies on efficient dynamic scheduling of short-chime vector instructions and limited out-of-order execution between independent sequencing paths, rather than costly register renaming or deep out-of-order scalar integration. The programming model uses standard RVV intrinsics, heavily leveraging dynamic vector length (`vl`), vector type (`vtype`), and register grouping (`LMUL`).
+
+**Memory Hierarchy**
+*   **Standard Memory Interface:** The VLSU typically bypasses the scalar L1 cache to access a coherent backing memory or a high-bandwidth, software-managed Tightly-Coupled Memory (TCM). It processes unit-stride accesses at full memory bandwidth.
+*   **Scatter-Gather TCM (SGTCM):** For high-throughput indexed (scatter/gather) accesses, Saturn can integrate a specialized, non-cacheable, deeply-banked SGTCM with parallel byte-wide ports. Without SGTCM, standard strided and indexed memory operations are bottlenecked to generating only one element address per cycle.
+*   **Segmented Accesses:** Saturn features dedicated double-buffered segment buffers (LSB/SSB) that efficiently handle segmented loads/stores (e.g., `vlseg`, `vsseg`). These instructions perform on-the-fly array-of-structs to struct-of-arrays repacking and are highly optimized, generally saturating memory bandwidth.
+*   **Memory Disambiguation:** Hardware performs precise early-stage scalar-vector and vector-vector memory disambiguation. However, vector memory instructions cannot begin execution if there are pending older scalar stores in the scalar store buffer.
+
+**Compute Units**
+*   **Datapath Width (DLEN):** The fundamental unit of compute and register access is the "element group," which is `DLEN` bits wide. The hardware processes 1 element group per cycle, regardless of element width (ELEN).
+*   **Chime Length:** The base occupancy of a vector instruction is `VLEN/DLEN` cycles. Using register grouping (`LMUL`) extends this to `LMUL * (VLEN/DLEN)` cycles.
+*   **Sequencers:** The backend is divided into independent, single-issue, in-order sequencers: Load (VLS), Store (VSS), Execute/Arithmetic (VXS), and Special (VPS - for index generation, slides, gathers, compress, reductions).
+*   **Functional Units:** FUs are pipelined but lack direct FU-to-FU bypasses. Typical latencies: Integer ALU/Shift/Bitmanip (1-2 stages), Integer Multiply (3 stages), FMA (4 stages). Divide and square root use iterative, non-pipelined units.
+*   **Issue Topologies:** Depending on the specific Saturn configuration, integer and floating-point operations may share a single sequencer (Unified), use separate sequencers fed by a shared queue (Shared), or use fully independent sequencers and queues (Split/Multi-ALU).
+
+**Key Constraints and Code Optimization Guidelines**
+*   **Maximize LMUL:** Because Saturn is a short-vector machine, low `LMUL` (e.g., 1) results in very short chimes (e.g., 2-4 cycles), which can expose pipeline latencies (e.g., a 4-stage FMA will stall dependent instructions if the chime is only 2 cycles). Always use the largest `LMUL` possible that avoids vector register spilling to increase chime length, hide pipeline latencies, and reduce scalar instruction fetch pressure.
+*   **Leverage Chaining via Instruction Interleaving:** Saturn supports vector chaining at the `DLEN` (element-group) granularity through the vector register file. Because sequencers are in-order, chaining only occurs between instructions occupying *different* sequencers (e.g., a load chaining into an arithmetic operation). Interleave independent memory and arithmetic intrinsics to maximize concurrent sequencer utilization.
+*   **Avoid Vector-to-Scalar Writes in Inner Loops:** Because vector instructions execute post-commit, any vector instruction that writes to a scalar register (e.g., `vfmv.f.s`, or vector reductions yielding a scalar) will cause a Read-After-Write (RAW) hazard that severely stalls the scalar pipeline. Keep reductions and scalar extractions outside of performance-critical inner loops.
+*   **Minimize `vsetvl` Bubbles:** Depending on the host scalar core (e.g., Rocket), changing `vtype` or `vl` can introduce pipeline bubbles. Group operations of the same element width and LMUL together to minimize the frequency of `vsetvl` transitions.
+*   **Prefer Segmented Loads over Manual Repacking:** Use RVV segmented load/store intrinsics for interleaved data (like complex numbers or RGB pixels) rather than loading raw vectors and manually permuting them, as Saturn's segment buffers handle this at near full memory bandwidth.
+*   **Avoid Standard Strided/Indexed Accesses:** Unless the target system explicitly features an SGTCM, avoid `vlsseg` (strided) and `vluxei` (indexed) intrinsics, as they execute at a slow rate of 1 element per cycle.
diff --git a/autocomp/agent_builder/.built/saturn-rvv/code_examples.md b/autocomp/agent_builder/.built/saturn-rvv/code_examples.md
@@ -0,0 +1,239 @@
+## rvv-intrinsic-examples.md
+
+SUMMARY: This document provides a collection of C code examples demonstrating the use of RISC-V Vector (RVV) intrinsics to implement common computational kernels, including memory operations, arithmetic, matrix multiplication, string manipulation, and conditional control flow.
+
+```c
+#include <riscv_vector.h>
+
+void *memcpy_rvv(void *restrict destination, const void *restrict source,
+    size_t n) {
+  unsigned char *dst = destination;
+  const unsigned char *src = source;
+  // copy data byte by byte
+  for (size_t vl; n > 0; n -= vl, src += vl, dst += vl) {
+    vl = __riscv_vsetvl_e8m8(n);
+    // Load src[0..vl)
+    vuint8m8_t vec_src = __riscv_vle8_v_u8m8(src, vl);
+    // Store dst[0..vl)
+    __riscv_vse8_v_u8m8(dst, vec_src, vl);
+  }
+  return destination;
+}
+```
+
+```c
+void saxpy_rvv(size_t n, const float a, const float *x, float *y) {
+  for (size_t vl; n > 0; n -= vl, x += vl, y += vl) {
+    vl = __riscv_vsetvl_e32m8(n);
+    // Load x[i..i+vl)
+    vfloat32m8_t vx = __riscv_vle32_v_f32m8(x, vl);
+    // Load y[i..i+vl)
+    vfloat32m8_t vy = __riscv_vle32_v_f32m8(y, vl);
+    // Computes vy[0..vl) + a*vx[0..vl)
+    // and stores it in y[i..i+vl)
+    __riscv_vse32_v_f32m8(y, __riscv_vfmacc_vf_f32m8(vy, a, vx, vl), vl);
+  }
+}
+```
+
+```c
+void matmul_rvv(double *a, double *b, double *c, int n, int m, int p) {
+  size_t vlmax = __riscv_vsetvlmax_e64m1();
+  for (int i = 0; i < n; ++i)
+    for (int j = 0; j < m; ++j) {
+      double *ptr_a = &a[i * p];
+      double *ptr_b = &b[j];
+      int k = p;
+      // Set accumulator to  zero.
+      vfloat64m1_t vec_s = __riscv_vfmv_v_f_f64m1(0.0, vlmax);
+      vfloat64m1_t vec_zero = __riscv_vfmv_v_f_f64m1(0.0, vlmax);
+      for (size_t vl; k > 0; k -= vl, ptr_a += vl, ptr_b += vl * m) {
+        vl = __riscv_vsetvl_e64m1(k);
+
+        // Load row a[i][k..k+vl)
+        vfloat64m1_t vec_a = __riscv_vle64_v_f64m1(ptr_a, vl);
+        // Load column b[k..k+vl)[j]
+        vfloat64m1_t vec_b =
+          __riscv_vlse64_v_f64m1(ptr_b, sizeof(double) * m, vl);
+
+        // Accumulate dot product of row and column. If vl < vlmax we need to
+        // preserve the existing values of vec_s, hence the tu policy.
+        vec_s = __riscv_vfmacc_vv_f64m1_tu(vec_s, vec_a, vec_b, vl);
+      }
+
+      // Final accumulation.
+      vfloat64m1_t vec_sum =
+        __riscv_vfredusum_vs_f64m1_f64m1(vec_s, vec_zero, vlmax);
+      double sum = __riscv_vfmv_f_s_f64m1_f64(vec_sum);
+      c[i * m + j] = sum;
+    }
+}
+```
+
+```c
+char *strcpy_rvv(char *destination, const char *source) {
+  unsigned char *dst = (unsigned char *)destination;
+  unsigned char *src = (unsigned char *)source;
+  size_t vlmax = __riscv_vsetvlmax_e8m8();
+  long first_set_bit = -1;
+
+  // This loop stops when among the loaded bytes we find the null byte
+  // of the string i.e., when first_set_bit >= 0
+  for (size_t vl; first_set_bit < 0; src += vl, dst += vl) {
+    // Load up to vlmax elements if possible.
+    vuint8m8_t vec_src = __riscv_vle8ff_v_u8m8(src, &vl, vlmax);
+
+    // Mask that states where null bytes are in the loaded bytes.
+    vbool1_t string_terminate = __riscv_vmseq_vx_u8m8_b1(vec_src, 0, vl);
+
+    // If the null byte is not in the loaded bytes the resulting mask will
+    // be all ones, otherwise only the elements up to and including the
+    // first null byte of the resulting will be enabled.
+    vbool1_t mask = __riscv_vmsif_m_b1(string_terminate, vl);
+
+    // Store the enabled elements as determined by the mask above.
+    __riscv_vse8_v_u8m8_m(mask, dst, vec_src, vl);
+
+    // Determine if we found the null byte in the loaded bytes.
+    first_set_bit = __riscv_vfirst_m_b1(string_terminate, vl);
+  }
+  return destination;
+}
+```
+
+```c
+void branch_rvv(double *a, double *b, double *c, int n, double constant) {
+  size_t vlmax = __riscv_vsetvlmax_e64m1();
+  vfloat64m1_t vec_constant = __riscv_vfmv_v_f_f64m1(constant, vlmax);
+  for (size_t vl; n > 0; n -= vl, a += vl, b += vl, c += vl) {
+    vl = __riscv_vsetvl_e64m1(n);
+
+    // Load a[i..i+vl)
+    vfloat64m1_t vec_a = __riscv_vle64_v_f64m1(a, vl);
+    // Load b[i..i+vl)
+    vfloat64m1_t vec_b = __riscv_vle64_v_f64m1(b, vl);
+
+    // Compute a mask whose enabled elements will correspond to the
+    // elements of b that are not zero.
+    vbool64_t mask = __riscv_vmfne_vf_f64m1_b64(vec_b, 0.0, vl);
+
+    // Use mask undisturbed policy to compute the division for the
+    // elements enabled in the mask, otherwise set them to the given
+    // constant above (maskedoff).
+    vfloat64m1_t vec_c = __riscv_vfdiv_vv_f64m1_mu(
+        mask, /*maskedoff*/ vec_constant, vec_a, vec_b, vl);
+
+    // Store into c[i..i+vl)
+    __riscv_vse64_v_f64m1(c, vec_c, vl);
+  }
+}
+```
+
+```c
+void reduce_rvv(double *a, double *b, double *result_sum, int *result_count,
+    int n) {
+  int count = 0;
+  size_t vlmax = __riscv_vsetvlmax_e64m1();
+  vfloat64m1_t vec_zero = __riscv_vfmv_v_f_f64m1(0.0, vlmax);
+  vfloat64m1_t vec_s = __riscv_vfmv_v_f_f64m1(0.0, vlmax);
+  for (size_t vl; n > 0; n -= vl, a += vl, b += vl) {
+    vl = __riscv_vsetvl_e64m1(n);
+
+    // Load a[i..i+vl)
+    vfloat64m1_t vec_a = __riscv_vle64_v_f64m1(a, vl);
+    // Load b[i..i+vl)
+    vfloat64m1_t vec_b = __riscv_vle64_v_f64m1(b, vl);
+
+    // Compute a mask whose enabled elements will correspond to the
+    // elements of a that are not 42.
+    vbool64_t mask = __riscv_vmfne_vf_f64m1_b64(vec_a, 42.0, vl);
+
+    // vec_s[e] ← vec_s[e] + vec_a[e] * vec_b[e], if mask[e] is enabled
+    vec_s = __riscv_vfmacc_vv_f64m1_tumu(mask, vec_s, vec_a, vec_b, vl);
+
+    // Adds to count the number of elements in mask that are enabled.
+    count += __riscv_vcpop_m_b64(mask, vl);
+  }
+
+  vfloat64m1_t vec_sum;
+  // Final accumulation.
+  vec_sum = __riscv_vfredusum_vs_f64m1_f64m1(vec_s, vec_zero, vlmax);
+  double sum = __riscv_vfmv_f_s_f64m1_f64(vec_sum);
+
+  // Return values.
+  *result_sum = sum;
+  *result_count = count;
+}
+```
+
+## rvvop.pdf:page_2
+
+SUMMARY: This document provides optimization guidelines for RISC-V Vector (RVV) intrinsics, focusing on LMUL selection, instruction variant preferences, and efficient memory access patterns for various data structures.
+
+```c
+// Adding 1.0 to each element of an array of 32-bit floats
+// (Note: Example assumes standard RVV intrinsic naming conventions)
+vfloat32m1_t vec = vle32_v_f32m1(ptr, vl);
+vec = vfadd_vf_f32m1(vec, 1.0f, vl);
+```
+
+```c
+// Broadcast 3 across all elements of the register group starting at v8
+vint32m1_t v8 = vmv_v_x_i32m1(3, vl);
+```
+
+```c
+// Splat alternating values of 0xaaaaaaaa and 0xbbbbbbbb into v2 using masked splat
+vint32m1_t v2 = vmv_v_x_i32m1(0xaaaaaaaa, vl);
+vbool32_t mask = vmsne_vx_i32m1_b32(vindex, 0, vl); // Assuming vindex defines the pattern
+v2 = vfmerge_vxm_i32m1(v2, 0xbbbbbbbb, mask, vl);
+```
+
+```c
+// Set the first element of a vector register to 2 and the remaining elements to 0
+vint32m1_t v = vmv_v_i_i32m1(0, vl);
+v = vmv_s_x_i32m1(v, 2, vl);
+```
+
+```c
+// Copying an array of bytes whose size is a multiple of 64kb using whole register loads/stores
+// a0: destination, a1: source, a2: number of bytes
+for (; a2 > 0; a2 -= vl) {
+    vl = vsetvlmax_e8m8();
+    vint8m8_t data = vlse8_v_i8m8(a1, 1, vl);
+    vsse8_v_i8m8(a0, 1, data, vl);
+    a1 += vl;
+    a0 += vl;
+}
+```
+
+## rvvop.pdf:page_3
+
+SUMMARY: This document demonstrates how to use RISC-V Vector (RVV) unit-stride segment load instructions to unpack interleaved RGB data into separate color channels for grayscale conversion. It highlights the performance benefits of using vector-vector (.vv) instructions over scalar-vector variants to minimize register transfer overhead.
+
+```c
+#include <riscv_vector.h>
+
+void rgb_to_grayscale(const uint8_t *src, uint8_t *dst, size_t n) {
+  for (size_t vl; n > 0; n -= vl) {
+    vl = __riscv_vsetvl_e8m1(n);
+
+    // Load interleaved RGB data into three separate vector registers
+    vuint8m1x3_t rgb = __riscv_vlseg3e8_v_u8m1x3(src, vl);
+    vuint8m1_t r = __riscv_vget_v_u8m1x3_u8m1(rgb, 0);
+    vuint8m1_t g = __riscv_vget_v_u8m1x3_u8m1(rgb, 1);
+    vuint8m1_t b = __riscv_vget_v_u8m1x3_u8m1(rgb, 2);
+
+    // Compute grayscale: (R + G + B) / 3 (simplified example)
+    vuint8m1_t sum = __riscv_vadd_vv_u8m1(r, g, vl);
+    sum = __riscv_vadd_vv_u8m1(sum, b, vl);
+    vuint8m1_t gray = __riscv_vdivu_vx_u8m1(sum, 3, vl);
+
+    // Store the result using a unit-stride store
+    __riscv_vse8_v_u8m1(dst, gray, vl);
+
+    src += vl * 3;
+    dst += vl;
+  }
+}
+```
-Original file line number
+Diff line change
@@ Expand Up / @@ -177,4 +177,4 @@ output/ @@
     wandb/
     config/
     tmp*/
-    temp*/
+    temp*/