Skip to content
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
3ca0652
add skeleton for rvv/saturn stuff
charleshong3 Jan 24, 2026
a9f158f
added saturn backend with working spike and firesim, also added a qs8…
keaganchern Jan 31, 2026
7405f69
implement saturn rvv agent
charleshong3 Jan 31, 2026
9320655
change rvv agent to saturn agent
charleshong3 Jan 31, 2026
2122675
Merge branch 'rvv-agent' into rvv
charleshong3 Jan 31, 2026
a831d9b
Merge branch 'rvv' of github.com:ucb-bar/autocomp into rvv
charleshong3 Jan 31, 2026
f88380a
update search.py to integrate saturn backend
charleshong3 Jan 31, 2026
94c4b43
delete rvv agent
charleshong3 Jan 31, 2026
bf0434a
added Zephyr setup for saturn setup
keaganchern Jan 31, 2026
4cea5c8
updated firesim setup
keaganchern Jan 31, 2026
914d429
update saturn eval docs
charleshong3 Feb 1, 2026
fbd5e5e
update paths
charleshong3 Feb 1, 2026
6456577
update saturn backend
charleshong3 Feb 1, 2026
a51b990
added double curly for f string complile error
keaganchern Feb 5, 2026
0d1943d
added latency to result for compatibility
keaganchern Feb 5, 2026
43968a2
updated noinline for concat firesim
keaganchern Feb 10, 2026
2c7ab59
redid parsing and also xnnpack can compile in zephyr now
keaganchern Feb 11, 2026
a5c07fb
updated saturn_setup.md
Feb 12, 2026
6b857ec
updated saturn_setup.md
huijae-an Feb 12, 2026
c4c67ca
another update to saturn_setup.md
huijae-an Feb 12, 2026
db467ca
merge main into rvv
charleshong3 Mar 2, 2026
62b44b1
added f32 test
keaganchern Mar 2, 2026
3290e12
Merge branch 'rvv-3' into rvv
keaganchern Mar 2, 2026
13960ca
fixes
charleshong3 Mar 2, 2026
d9ae003
fix cmakelists
charleshong3 Mar 2, 2026
2c75caa
fixes
charleshong3 Mar 2, 2026
2085fe7
added better para
keaganchern Mar 2, 2026
6552cb3
build and spike together
Mar 2, 2026
61e5f2e
fix saturn_isa_generator to not give llm unnecessary work
charleshong3 Mar 4, 2026
83ad13a
redefined functions from prob
keaganchern Mar 4, 2026
6f02811
split common and xnnpack to 2 specific backends
Mar 20, 2026
0c26210
merge main into rvv
Mar 20, 2026
5bd9eb0
merging main into rvv
Mar 31, 2026
f16d434
updating saturn and xnnpack, ready for upstream
Apr 9, 2026
ae0771a
merging main into rvv before upstream
Apr 9, 2026
8f9f0d1
readme with saturn
Apr 9, 2026
dda7a1b
editing file names, docs, and translation menu
Apr 9, 2026
b8ea854
Update README with new links for hardware targets
charleshong3 Apr 10, 2026
0ced933
updating pathing issues
Apr 10, 2026
0fb2ef1
Merge branch 'rvv' of github.com:ucb-bar/autocomp into rvv
Apr 10, 2026
bfeee9b
Revise Saturn target names and update changelog
charleshong3 Apr 10, 2026
75d1ad3
Consolidate Saturn (RVV) and XNNPACK entries in README
charleshong3 Apr 10, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -177,4 +177,4 @@ output/
wandb/
config/
tmp*/
temp*/
temp*/
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

**Autocomp** is an extensible, portable framework for LLM-driven kernel optimization across tensor accelerators. Point it at a kernel, pick your hardware target, and Autocomp speeds it up, automatically.

It already delivers strong results across **[AWS Trainium](https://aws.amazon.com/ai/machine-learning/trainium/)**, **[Google TPU](https://cloud.google.com/tpu)**, **NVIDIA GPUs**, **[Gemmini](https://github.com/ucb-bar/gemmini)**, and the **RISC-V Vector Extension**. Need a new target? The **[Agent Builder](autocomp/agent_builder/README.md)** can spin up a hardware-specific optimization agent from your docs in minutes.
It already delivers strong results across **[AWS Trainium](https://aws.amazon.com/ai/machine-learning/trainium/)**, **[Google TPU](https://cloud.google.com/tpu)**, **[NVIDIA GPUs](https://charleshong3.github.io/blog/autocomp_update.html)**, **[Gemmini](https://github.com/ucb-bar/gemmini)**, and **[RISC-V Vector Processors](https://saturn-vectors.org/)**. Need a new target? The **[Agent Builder](autocomp/agent_builder/README.md)** can spin up a hardware-specific optimization agent from your docs in minutes.

<p align="center">
<a href="https://arxiv.org/abs/2505.18574"><b>📚 Read the paper</b></a>&nbsp;&nbsp;·&nbsp;&nbsp;<b>✏️ Authors:</b> <a href="https://charleshong3.github.io/">Charles Hong</a>, <a href="https://x.com/sahilb17">Sahil Bhatia</a>, <a href="https://people.eecs.berkeley.edu/~akcheung/">Alvin Cheung</a>, <a href="https://people.eecs.berkeley.edu/~ysshao/">Yakun Sophia Shao</a> (UC Berkeley)
Expand Down Expand Up @@ -65,6 +65,8 @@ Each hardware target requires two things: an **optimization agent** that knows h
| Google TPU | `built:tpu-v6e` (TPU v6e) | `tpu` ([tpu_setup.md](autocomp/backend/tpu/tpu_setup.md)) |
| Gemmini | `gemmini` | `gemmini` ([gemmini_setup.md](autocomp/backend/gemmini/gemmini_setup.md)) |
| NVIDIA GPU | `cuda` | `kernelbench` ([kb_setup.md](autocomp/backend/kernelbench/kb_setup.md))<br>`gpumode` ([gpumode_setup.md](autocomp/backend/gpumode/gpumode_setup.md)) |
| Saturn (RVV) | `built:saturn-rvv` | `saturn` ([saturn_setup.md](autocomp/backend/saturn/saturn_setup.md)) |
| Saturn + XNNPACK (RVV) | `built:saturn-rvv` | `xnnpack` ([xnnpack_setup.md](autocomp/backend/xnnpack/xnnpack_setup.md)) |

Partially supported hardware targets:
- RISC-V Vector (RVV) on Canaan Kendryte K230. See `k230` branch for code. As the implementation is very hacky, we do not currently recommend using this hardware target.
Expand Down Expand Up @@ -193,11 +195,13 @@ The most important parameters are:
- `TpuHardwareConfig("v6e-1")`
- `GemminiHardwareConfig(pe_dim=16, spad_size_kb=256, acc_size_kb=64)`
- `CudaHardwareConfig("NVIDIA L40S", "2.5.0", "12.4")`
- `SaturnHardwareConfig(vlen=512, dlen=256)`

**Evaluation Backend**
- `backend_name`: The evaluation backend to use. Currently supported values are `trn`, `tpu`, `gemmini`, `kernelbench`, and `gpumode`.
- `backend_name`: The evaluation backend to use. Currently supported values are `trn`, `tpu`, `gemmini`, `kernelbench`, `gpumode`, `saturn`, and `xnnpack`.
- `simulator`: The evaluation method to use, if the backend supports multiple. For all others, put `None`.
- For Gemmini, `spike` (only optimizes instruction counts, not cycle counts) or `firesim`
- For Saturn/XNNPACK, `spike` or `firesim`
- For CUDA/GPU MODE, `gpumode-local` or `gpumode-cli`

**Benchmark**
Expand All @@ -207,6 +211,8 @@ The most important parameters are:
- For Gemmini, `gemm`, `conv`, or `admm-multifunction`.
- For CUDA/KernelBench, `kb-level1`, `kb-level2`, `kb-level3`, or `kb-level4`.
- For CUDA/GPU MODE, `gpumode`.
- For Saturn, `rvv-f32` or `rvv-qs8`.
- For XNNPACK, `xnnpack-f32`.
- `prob_id`: The problem ID to use.

**Optimization Agent**
Expand Down Expand Up @@ -294,6 +300,8 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for more details on how to add tests and

## 📝 Changelog

**(4/9/2026)** Added [Saturn RVV](https://saturn-vectors.org/) as a new hardware target.

**(4/6/2026)** Renamed `tests/` to `harnesses/` and solution entry point from `test()` to `solution()` for clarity. Improved agent builder logging.

**(4/3/2026)** Added run metrics (runtime and tokens) and updated Trace Visualizer to be self-contained.
Expand Down
12 changes: 12 additions & 0 deletions autocomp/agent_builder/.built/saturn-rvv/agent_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
agent_name: saturn-rvv
version: '1.0'
built_at: '2026-03-31T20:55:15+00:00'
build:
main_model: gemini-3.1-pro-preview
light_model: gemini-3.1-flash-lite-preview
agent_scope: Optimizing RVV (RISC-V Vector) Intrinsics kernel code on Saturn(all code written in rvv intrinsics and not
in rvv asm). The agent rewrites single-kernel source code for better performance.
sources:
- type: directory
path: /scratch/kchern2/autocomp-demo/saturn-docs
description: Auto-generated agent config for saturn-rvv
25 changes: 25 additions & 0 deletions autocomp/agent_builder/.built/saturn-rvv/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
**Hardware Architecture Summary: Saturn RISC-V Vector (RVV) Microarchitecture**

**Overview and Programming Model**
Saturn is a parameterized, short-vector microarchitecture implementing the RISC-V Vector (RVV) extension, designed primarily for domain-specialized, DSP, and embedded cores. It employs a Decoupled Access-Execute (DAE) design where the Vector Load-Store Unit (VLSU) and Vector Datapath (VU) operate independently. Saturn executes vector instructions strictly post-commit relative to the scalar core, meaning vector instructions are non-speculative. It relies on efficient dynamic scheduling of short-chime vector instructions and limited out-of-order execution between independent sequencing paths, rather than costly register renaming or deep out-of-order scalar integration. The programming model uses standard RVV intrinsics, heavily leveraging dynamic vector length (`vl`), vector type (`vtype`), and register grouping (`LMUL`).

**Memory Hierarchy**
* **Standard Memory Interface:** The VLSU typically bypasses the scalar L1 cache to access a coherent backing memory or a high-bandwidth, software-managed Tightly-Coupled Memory (TCM). It processes unit-stride accesses at full memory bandwidth.
* **Scatter-Gather TCM (SGTCM):** For high-throughput indexed (scatter/gather) accesses, Saturn can integrate a specialized, non-cacheable, deeply-banked SGTCM with parallel byte-wide ports. Without SGTCM, standard strided and indexed memory operations are bottlenecked to generating only one element address per cycle.
* **Segmented Accesses:** Saturn features dedicated double-buffered segment buffers (LSB/SSB) that efficiently handle segmented loads/stores (e.g., `vlseg`, `vsseg`). These instructions perform on-the-fly array-of-structs to struct-of-arrays repacking and are highly optimized, generally saturating memory bandwidth.
* **Memory Disambiguation:** Hardware performs precise early-stage scalar-vector and vector-vector memory disambiguation. However, vector memory instructions cannot begin execution if there are pending older scalar stores in the scalar store buffer.

**Compute Units**
* **Datapath Width (DLEN):** The fundamental unit of compute and register access is the "element group," which is `DLEN` bits wide. The hardware processes 1 element group per cycle, regardless of element width (ELEN).
* **Chime Length:** The base occupancy of a vector instruction is `VLEN/DLEN` cycles. Using register grouping (`LMUL`) extends this to `LMUL * (VLEN/DLEN)` cycles.
* **Sequencers:** The backend is divided into independent, single-issue, in-order sequencers: Load (VLS), Store (VSS), Execute/Arithmetic (VXS), and Special (VPS - for index generation, slides, gathers, compress, reductions).
* **Functional Units:** FUs are pipelined but lack direct FU-to-FU bypasses. Typical latencies: Integer ALU/Shift/Bitmanip (1-2 stages), Integer Multiply (3 stages), FMA (4 stages). Divide and square root use iterative, non-pipelined units.
* **Issue Topologies:** Depending on the specific Saturn configuration, integer and floating-point operations may share a single sequencer (Unified), use separate sequencers fed by a shared queue (Shared), or use fully independent sequencers and queues (Split/Multi-ALU).

**Key Constraints and Code Optimization Guidelines**
* **Maximize LMUL:** Because Saturn is a short-vector machine, low `LMUL` (e.g., 1) results in very short chimes (e.g., 2-4 cycles), which can expose pipeline latencies (e.g., a 4-stage FMA will stall dependent instructions if the chime is only 2 cycles). Always use the largest `LMUL` possible that avoids vector register spilling to increase chime length, hide pipeline latencies, and reduce scalar instruction fetch pressure.
* **Leverage Chaining via Instruction Interleaving:** Saturn supports vector chaining at the `DLEN` (element-group) granularity through the vector register file. Because sequencers are in-order, chaining only occurs between instructions occupying *different* sequencers (e.g., a load chaining into an arithmetic operation). Interleave independent memory and arithmetic intrinsics to maximize concurrent sequencer utilization.
* **Avoid Vector-to-Scalar Writes in Inner Loops:** Because vector instructions execute post-commit, any vector instruction that writes to a scalar register (e.g., `vfmv.f.s`, or vector reductions yielding a scalar) will cause a Read-After-Write (RAW) hazard that severely stalls the scalar pipeline. Keep reductions and scalar extractions outside of performance-critical inner loops.
* **Minimize `vsetvl` Bubbles:** Depending on the host scalar core (e.g., Rocket), changing `vtype` or `vl` can introduce pipeline bubbles. Group operations of the same element width and LMUL together to minimize the frequency of `vsetvl` transitions.
* **Prefer Segmented Loads over Manual Repacking:** Use RVV segmented load/store intrinsics for interleaved data (like complex numbers or RGB pixels) rather than loading raw vectors and manually permuting them, as Saturn's segment buffers handle this at near full memory bandwidth.
* **Avoid Standard Strided/Indexed Accesses:** Unless the target system explicitly features an SGTCM, avoid `vlsseg` (strided) and `vluxei` (indexed) intrinsics, as they execute at a slow rate of 1 element per cycle.
239 changes: 239 additions & 0 deletions autocomp/agent_builder/.built/saturn-rvv/code_examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
## rvv-intrinsic-examples.md

SUMMARY: This document provides a collection of C code examples demonstrating the use of RISC-V Vector (RVV) intrinsics to implement common computational kernels, including memory operations, arithmetic, matrix multiplication, string manipulation, and conditional control flow.

```c
#include <riscv_vector.h>

void *memcpy_rvv(void *restrict destination, const void *restrict source,
size_t n) {
unsigned char *dst = destination;
const unsigned char *src = source;
// copy data byte by byte
for (size_t vl; n > 0; n -= vl, src += vl, dst += vl) {
vl = __riscv_vsetvl_e8m8(n);
// Load src[0..vl)
vuint8m8_t vec_src = __riscv_vle8_v_u8m8(src, vl);
// Store dst[0..vl)
__riscv_vse8_v_u8m8(dst, vec_src, vl);
}
return destination;
}
```

```c
void saxpy_rvv(size_t n, const float a, const float *x, float *y) {
for (size_t vl; n > 0; n -= vl, x += vl, y += vl) {
vl = __riscv_vsetvl_e32m8(n);
// Load x[i..i+vl)
vfloat32m8_t vx = __riscv_vle32_v_f32m8(x, vl);
// Load y[i..i+vl)
vfloat32m8_t vy = __riscv_vle32_v_f32m8(y, vl);
// Computes vy[0..vl) + a*vx[0..vl)
// and stores it in y[i..i+vl)
__riscv_vse32_v_f32m8(y, __riscv_vfmacc_vf_f32m8(vy, a, vx, vl), vl);
}
}
```

```c
void matmul_rvv(double *a, double *b, double *c, int n, int m, int p) {
size_t vlmax = __riscv_vsetvlmax_e64m1();
for (int i = 0; i < n; ++i)
for (int j = 0; j < m; ++j) {
double *ptr_a = &a[i * p];
double *ptr_b = &b[j];
int k = p;
// Set accumulator to zero.
vfloat64m1_t vec_s = __riscv_vfmv_v_f_f64m1(0.0, vlmax);
vfloat64m1_t vec_zero = __riscv_vfmv_v_f_f64m1(0.0, vlmax);
for (size_t vl; k > 0; k -= vl, ptr_a += vl, ptr_b += vl * m) {
vl = __riscv_vsetvl_e64m1(k);

// Load row a[i][k..k+vl)
vfloat64m1_t vec_a = __riscv_vle64_v_f64m1(ptr_a, vl);
// Load column b[k..k+vl)[j]
vfloat64m1_t vec_b =
__riscv_vlse64_v_f64m1(ptr_b, sizeof(double) * m, vl);

// Accumulate dot product of row and column. If vl < vlmax we need to
// preserve the existing values of vec_s, hence the tu policy.
vec_s = __riscv_vfmacc_vv_f64m1_tu(vec_s, vec_a, vec_b, vl);
}

// Final accumulation.
vfloat64m1_t vec_sum =
__riscv_vfredusum_vs_f64m1_f64m1(vec_s, vec_zero, vlmax);
double sum = __riscv_vfmv_f_s_f64m1_f64(vec_sum);
c[i * m + j] = sum;
}
}
```

```c
char *strcpy_rvv(char *destination, const char *source) {
unsigned char *dst = (unsigned char *)destination;
unsigned char *src = (unsigned char *)source;
size_t vlmax = __riscv_vsetvlmax_e8m8();
long first_set_bit = -1;

// This loop stops when among the loaded bytes we find the null byte
// of the string i.e., when first_set_bit >= 0
for (size_t vl; first_set_bit < 0; src += vl, dst += vl) {
// Load up to vlmax elements if possible.
vuint8m8_t vec_src = __riscv_vle8ff_v_u8m8(src, &vl, vlmax);

// Mask that states where null bytes are in the loaded bytes.
vbool1_t string_terminate = __riscv_vmseq_vx_u8m8_b1(vec_src, 0, vl);

// If the null byte is not in the loaded bytes the resulting mask will
// be all ones, otherwise only the elements up to and including the
// first null byte of the resulting will be enabled.
vbool1_t mask = __riscv_vmsif_m_b1(string_terminate, vl);

// Store the enabled elements as determined by the mask above.
__riscv_vse8_v_u8m8_m(mask, dst, vec_src, vl);

// Determine if we found the null byte in the loaded bytes.
first_set_bit = __riscv_vfirst_m_b1(string_terminate, vl);
}
return destination;
}
```

```c
void branch_rvv(double *a, double *b, double *c, int n, double constant) {
size_t vlmax = __riscv_vsetvlmax_e64m1();
vfloat64m1_t vec_constant = __riscv_vfmv_v_f_f64m1(constant, vlmax);
for (size_t vl; n > 0; n -= vl, a += vl, b += vl, c += vl) {
vl = __riscv_vsetvl_e64m1(n);

// Load a[i..i+vl)
vfloat64m1_t vec_a = __riscv_vle64_v_f64m1(a, vl);
// Load b[i..i+vl)
vfloat64m1_t vec_b = __riscv_vle64_v_f64m1(b, vl);

// Compute a mask whose enabled elements will correspond to the
// elements of b that are not zero.
vbool64_t mask = __riscv_vmfne_vf_f64m1_b64(vec_b, 0.0, vl);

// Use mask undisturbed policy to compute the division for the
// elements enabled in the mask, otherwise set them to the given
// constant above (maskedoff).
vfloat64m1_t vec_c = __riscv_vfdiv_vv_f64m1_mu(
mask, /*maskedoff*/ vec_constant, vec_a, vec_b, vl);

// Store into c[i..i+vl)
__riscv_vse64_v_f64m1(c, vec_c, vl);
}
}
```

```c
void reduce_rvv(double *a, double *b, double *result_sum, int *result_count,
int n) {
int count = 0;
size_t vlmax = __riscv_vsetvlmax_e64m1();
vfloat64m1_t vec_zero = __riscv_vfmv_v_f_f64m1(0.0, vlmax);
vfloat64m1_t vec_s = __riscv_vfmv_v_f_f64m1(0.0, vlmax);
for (size_t vl; n > 0; n -= vl, a += vl, b += vl) {
vl = __riscv_vsetvl_e64m1(n);

// Load a[i..i+vl)
vfloat64m1_t vec_a = __riscv_vle64_v_f64m1(a, vl);
// Load b[i..i+vl)
vfloat64m1_t vec_b = __riscv_vle64_v_f64m1(b, vl);

// Compute a mask whose enabled elements will correspond to the
// elements of a that are not 42.
vbool64_t mask = __riscv_vmfne_vf_f64m1_b64(vec_a, 42.0, vl);

// vec_s[e] ← vec_s[e] + vec_a[e] * vec_b[e], if mask[e] is enabled
vec_s = __riscv_vfmacc_vv_f64m1_tumu(mask, vec_s, vec_a, vec_b, vl);

// Adds to count the number of elements in mask that are enabled.
count += __riscv_vcpop_m_b64(mask, vl);
}

vfloat64m1_t vec_sum;
// Final accumulation.
vec_sum = __riscv_vfredusum_vs_f64m1_f64m1(vec_s, vec_zero, vlmax);
double sum = __riscv_vfmv_f_s_f64m1_f64(vec_sum);

// Return values.
*result_sum = sum;
*result_count = count;
}
```

## rvvop.pdf:page_2

SUMMARY: This document provides optimization guidelines for RISC-V Vector (RVV) intrinsics, focusing on LMUL selection, instruction variant preferences, and efficient memory access patterns for various data structures.

```c
// Adding 1.0 to each element of an array of 32-bit floats
// (Note: Example assumes standard RVV intrinsic naming conventions)
vfloat32m1_t vec = vle32_v_f32m1(ptr, vl);
vec = vfadd_vf_f32m1(vec, 1.0f, vl);
```

```c
// Broadcast 3 across all elements of the register group starting at v8
vint32m1_t v8 = vmv_v_x_i32m1(3, vl);
```

```c
// Splat alternating values of 0xaaaaaaaa and 0xbbbbbbbb into v2 using masked splat
vint32m1_t v2 = vmv_v_x_i32m1(0xaaaaaaaa, vl);
vbool32_t mask = vmsne_vx_i32m1_b32(vindex, 0, vl); // Assuming vindex defines the pattern
v2 = vfmerge_vxm_i32m1(v2, 0xbbbbbbbb, mask, vl);
```

```c
// Set the first element of a vector register to 2 and the remaining elements to 0
vint32m1_t v = vmv_v_i_i32m1(0, vl);
v = vmv_s_x_i32m1(v, 2, vl);
```

```c
// Copying an array of bytes whose size is a multiple of 64kb using whole register loads/stores
// a0: destination, a1: source, a2: number of bytes
for (; a2 > 0; a2 -= vl) {
vl = vsetvlmax_e8m8();
vint8m8_t data = vlse8_v_i8m8(a1, 1, vl);
vsse8_v_i8m8(a0, 1, data, vl);
a1 += vl;
a0 += vl;
}
```

## rvvop.pdf:page_3

SUMMARY: This document demonstrates how to use RISC-V Vector (RVV) unit-stride segment load instructions to unpack interleaved RGB data into separate color channels for grayscale conversion. It highlights the performance benefits of using vector-vector (.vv) instructions over scalar-vector variants to minimize register transfer overhead.

```c
#include <riscv_vector.h>

void rgb_to_grayscale(const uint8_t *src, uint8_t *dst, size_t n) {
for (size_t vl; n > 0; n -= vl) {
vl = __riscv_vsetvl_e8m1(n);

// Load interleaved RGB data into three separate vector registers
vuint8m1x3_t rgb = __riscv_vlseg3e8_v_u8m1x3(src, vl);
vuint8m1_t r = __riscv_vget_v_u8m1x3_u8m1(rgb, 0);
vuint8m1_t g = __riscv_vget_v_u8m1x3_u8m1(rgb, 1);
vuint8m1_t b = __riscv_vget_v_u8m1x3_u8m1(rgb, 2);

// Compute grayscale: (R + G + B) / 3 (simplified example)
vuint8m1_t sum = __riscv_vadd_vv_u8m1(r, g, vl);
sum = __riscv_vadd_vv_u8m1(sum, b, vl);
vuint8m1_t gray = __riscv_vdivu_vx_u8m1(sum, 3, vl);

// Store the result using a unit-stride store
__riscv_vse8_v_u8m1(dst, gray, vl);

src += vl * 3;
dst += vl;
}
}
```
Loading
Loading