-
Notifications
You must be signed in to change notification settings - Fork 6
RVV integration with saturn and xnnpack #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 41 commits
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
3ca0652
add skeleton for rvv/saturn stuff
charleshong3 a9f158f
added saturn backend with working spike and firesim, also added a qs8…
keaganchern 7405f69
implement saturn rvv agent
charleshong3 9320655
change rvv agent to saturn agent
charleshong3 2122675
Merge branch 'rvv-agent' into rvv
charleshong3 a831d9b
Merge branch 'rvv' of github.com:ucb-bar/autocomp into rvv
charleshong3 f88380a
update search.py to integrate saturn backend
charleshong3 94c4b43
delete rvv agent
charleshong3 bf0434a
added Zephyr setup for saturn setup
keaganchern 4cea5c8
updated firesim setup
keaganchern 914d429
update saturn eval docs
charleshong3 fbd5e5e
update paths
charleshong3 6456577
update saturn backend
charleshong3 a51b990
added double curly for f string complile error
keaganchern 0d1943d
added latency to result for compatibility
keaganchern 43968a2
updated noinline for concat firesim
keaganchern 2c7ab59
redid parsing and also xnnpack can compile in zephyr now
keaganchern a5c07fb
updated saturn_setup.md
6b857ec
updated saturn_setup.md
huijae-an c4c67ca
another update to saturn_setup.md
huijae-an db467ca
merge main into rvv
charleshong3 62b44b1
added f32 test
keaganchern 3290e12
Merge branch 'rvv-3' into rvv
keaganchern 13960ca
fixes
charleshong3 d9ae003
fix cmakelists
charleshong3 2c75caa
fixes
charleshong3 2085fe7
added better para
keaganchern 6552cb3
build and spike together
61e5f2e
fix saturn_isa_generator to not give llm unnecessary work
charleshong3 83ad13a
redefined functions from prob
keaganchern 6f02811
split common and xnnpack to 2 specific backends
0c26210
merge main into rvv
5bd9eb0
merging main into rvv
f16d434
updating saturn and xnnpack, ready for upstream
ae0771a
merging main into rvv before upstream
8f9f0d1
readme with saturn
dda7a1b
editing file names, docs, and translation menu
b8ea854
Update README with new links for hardware targets
charleshong3 0ced933
updating pathing issues
0fb2ef1
Merge branch 'rvv' of github.com:ucb-bar/autocomp into rvv
bfeee9b
Revise Saturn target names and update changelog
charleshong3 75d1ad3
Consolidate Saturn (RVV) and XNNPACK entries in README
charleshong3 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -177,4 +177,4 @@ output/ | |
| wandb/ | ||
| config/ | ||
| tmp*/ | ||
| temp*/ | ||
| temp*/ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
12 changes: 12 additions & 0 deletions
12
autocomp/agent_builder/.built/saturn-rvv/agent_config.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| agent_name: saturn-rvv | ||
| version: '1.0' | ||
| built_at: '2026-03-31T20:55:15+00:00' | ||
| build: | ||
| main_model: gemini-3.1-pro-preview | ||
| light_model: gemini-3.1-flash-lite-preview | ||
| agent_scope: Optimizing RVV (RISC-V Vector) Intrinsics kernel code on Saturn(all code written in rvv intrinsics and not | ||
| in rvv asm). The agent rewrites single-kernel source code for better performance. | ||
| sources: | ||
| - type: directory | ||
| path: /scratch/kchern2/autocomp-demo/saturn-docs | ||
| description: Auto-generated agent config for saturn-rvv |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| **Hardware Architecture Summary: Saturn RISC-V Vector (RVV) Microarchitecture** | ||
|
|
||
| **Overview and Programming Model** | ||
| Saturn is a parameterized, short-vector microarchitecture implementing the RISC-V Vector (RVV) extension, designed primarily for domain-specialized, DSP, and embedded cores. It employs a Decoupled Access-Execute (DAE) design where the Vector Load-Store Unit (VLSU) and Vector Datapath (VU) operate independently. Saturn executes vector instructions strictly post-commit relative to the scalar core, meaning vector instructions are non-speculative. It relies on efficient dynamic scheduling of short-chime vector instructions and limited out-of-order execution between independent sequencing paths, rather than costly register renaming or deep out-of-order scalar integration. The programming model uses standard RVV intrinsics, heavily leveraging dynamic vector length (`vl`), vector type (`vtype`), and register grouping (`LMUL`). | ||
|
|
||
| **Memory Hierarchy** | ||
| * **Standard Memory Interface:** The VLSU typically bypasses the scalar L1 cache to access a coherent backing memory or a high-bandwidth, software-managed Tightly-Coupled Memory (TCM). It processes unit-stride accesses at full memory bandwidth. | ||
| * **Scatter-Gather TCM (SGTCM):** For high-throughput indexed (scatter/gather) accesses, Saturn can integrate a specialized, non-cacheable, deeply-banked SGTCM with parallel byte-wide ports. Without SGTCM, standard strided and indexed memory operations are bottlenecked to generating only one element address per cycle. | ||
| * **Segmented Accesses:** Saturn features dedicated double-buffered segment buffers (LSB/SSB) that efficiently handle segmented loads/stores (e.g., `vlseg`, `vsseg`). These instructions perform on-the-fly array-of-structs to struct-of-arrays repacking and are highly optimized, generally saturating memory bandwidth. | ||
| * **Memory Disambiguation:** Hardware performs precise early-stage scalar-vector and vector-vector memory disambiguation. However, vector memory instructions cannot begin execution if there are pending older scalar stores in the scalar store buffer. | ||
|
|
||
| **Compute Units** | ||
| * **Datapath Width (DLEN):** The fundamental unit of compute and register access is the "element group," which is `DLEN` bits wide. The hardware processes 1 element group per cycle, regardless of element width (ELEN). | ||
| * **Chime Length:** The base occupancy of a vector instruction is `VLEN/DLEN` cycles. Using register grouping (`LMUL`) extends this to `LMUL * (VLEN/DLEN)` cycles. | ||
| * **Sequencers:** The backend is divided into independent, single-issue, in-order sequencers: Load (VLS), Store (VSS), Execute/Arithmetic (VXS), and Special (VPS - for index generation, slides, gathers, compress, reductions). | ||
| * **Functional Units:** FUs are pipelined but lack direct FU-to-FU bypasses. Typical latencies: Integer ALU/Shift/Bitmanip (1-2 stages), Integer Multiply (3 stages), FMA (4 stages). Divide and square root use iterative, non-pipelined units. | ||
| * **Issue Topologies:** Depending on the specific Saturn configuration, integer and floating-point operations may share a single sequencer (Unified), use separate sequencers fed by a shared queue (Shared), or use fully independent sequencers and queues (Split/Multi-ALU). | ||
|
|
||
| **Key Constraints and Code Optimization Guidelines** | ||
| * **Maximize LMUL:** Because Saturn is a short-vector machine, low `LMUL` (e.g., 1) results in very short chimes (e.g., 2-4 cycles), which can expose pipeline latencies (e.g., a 4-stage FMA will stall dependent instructions if the chime is only 2 cycles). Always use the largest `LMUL` possible that avoids vector register spilling to increase chime length, hide pipeline latencies, and reduce scalar instruction fetch pressure. | ||
| * **Leverage Chaining via Instruction Interleaving:** Saturn supports vector chaining at the `DLEN` (element-group) granularity through the vector register file. Because sequencers are in-order, chaining only occurs between instructions occupying *different* sequencers (e.g., a load chaining into an arithmetic operation). Interleave independent memory and arithmetic intrinsics to maximize concurrent sequencer utilization. | ||
| * **Avoid Vector-to-Scalar Writes in Inner Loops:** Because vector instructions execute post-commit, any vector instruction that writes to a scalar register (e.g., `vfmv.f.s`, or vector reductions yielding a scalar) will cause a Read-After-Write (RAW) hazard that severely stalls the scalar pipeline. Keep reductions and scalar extractions outside of performance-critical inner loops. | ||
| * **Minimize `vsetvl` Bubbles:** Depending on the host scalar core (e.g., Rocket), changing `vtype` or `vl` can introduce pipeline bubbles. Group operations of the same element width and LMUL together to minimize the frequency of `vsetvl` transitions. | ||
| * **Prefer Segmented Loads over Manual Repacking:** Use RVV segmented load/store intrinsics for interleaved data (like complex numbers or RGB pixels) rather than loading raw vectors and manually permuting them, as Saturn's segment buffers handle this at near full memory bandwidth. | ||
| * **Avoid Standard Strided/Indexed Accesses:** Unless the target system explicitly features an SGTCM, avoid `vlsseg` (strided) and `vluxei` (indexed) intrinsics, as they execute at a slow rate of 1 element per cycle. |
239 changes: 239 additions & 0 deletions
239
autocomp/agent_builder/.built/saturn-rvv/code_examples.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,239 @@ | ||
| ## rvv-intrinsic-examples.md | ||
|
|
||
| SUMMARY: This document provides a collection of C code examples demonstrating the use of RISC-V Vector (RVV) intrinsics to implement common computational kernels, including memory operations, arithmetic, matrix multiplication, string manipulation, and conditional control flow. | ||
|
|
||
| ```c | ||
| #include <riscv_vector.h> | ||
|
|
||
| void *memcpy_rvv(void *restrict destination, const void *restrict source, | ||
| size_t n) { | ||
| unsigned char *dst = destination; | ||
| const unsigned char *src = source; | ||
| // copy data byte by byte | ||
| for (size_t vl; n > 0; n -= vl, src += vl, dst += vl) { | ||
| vl = __riscv_vsetvl_e8m8(n); | ||
| // Load src[0..vl) | ||
| vuint8m8_t vec_src = __riscv_vle8_v_u8m8(src, vl); | ||
| // Store dst[0..vl) | ||
| __riscv_vse8_v_u8m8(dst, vec_src, vl); | ||
| } | ||
| return destination; | ||
| } | ||
| ``` | ||
|
|
||
| ```c | ||
| void saxpy_rvv(size_t n, const float a, const float *x, float *y) { | ||
| for (size_t vl; n > 0; n -= vl, x += vl, y += vl) { | ||
| vl = __riscv_vsetvl_e32m8(n); | ||
| // Load x[i..i+vl) | ||
| vfloat32m8_t vx = __riscv_vle32_v_f32m8(x, vl); | ||
| // Load y[i..i+vl) | ||
| vfloat32m8_t vy = __riscv_vle32_v_f32m8(y, vl); | ||
| // Computes vy[0..vl) + a*vx[0..vl) | ||
| // and stores it in y[i..i+vl) | ||
| __riscv_vse32_v_f32m8(y, __riscv_vfmacc_vf_f32m8(vy, a, vx, vl), vl); | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ```c | ||
| void matmul_rvv(double *a, double *b, double *c, int n, int m, int p) { | ||
| size_t vlmax = __riscv_vsetvlmax_e64m1(); | ||
| for (int i = 0; i < n; ++i) | ||
| for (int j = 0; j < m; ++j) { | ||
| double *ptr_a = &a[i * p]; | ||
| double *ptr_b = &b[j]; | ||
| int k = p; | ||
| // Set accumulator to zero. | ||
| vfloat64m1_t vec_s = __riscv_vfmv_v_f_f64m1(0.0, vlmax); | ||
| vfloat64m1_t vec_zero = __riscv_vfmv_v_f_f64m1(0.0, vlmax); | ||
| for (size_t vl; k > 0; k -= vl, ptr_a += vl, ptr_b += vl * m) { | ||
| vl = __riscv_vsetvl_e64m1(k); | ||
|
|
||
| // Load row a[i][k..k+vl) | ||
| vfloat64m1_t vec_a = __riscv_vle64_v_f64m1(ptr_a, vl); | ||
| // Load column b[k..k+vl)[j] | ||
| vfloat64m1_t vec_b = | ||
| __riscv_vlse64_v_f64m1(ptr_b, sizeof(double) * m, vl); | ||
|
|
||
| // Accumulate dot product of row and column. If vl < vlmax we need to | ||
| // preserve the existing values of vec_s, hence the tu policy. | ||
| vec_s = __riscv_vfmacc_vv_f64m1_tu(vec_s, vec_a, vec_b, vl); | ||
| } | ||
|
|
||
| // Final accumulation. | ||
| vfloat64m1_t vec_sum = | ||
| __riscv_vfredusum_vs_f64m1_f64m1(vec_s, vec_zero, vlmax); | ||
| double sum = __riscv_vfmv_f_s_f64m1_f64(vec_sum); | ||
| c[i * m + j] = sum; | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ```c | ||
| char *strcpy_rvv(char *destination, const char *source) { | ||
| unsigned char *dst = (unsigned char *)destination; | ||
| unsigned char *src = (unsigned char *)source; | ||
| size_t vlmax = __riscv_vsetvlmax_e8m8(); | ||
| long first_set_bit = -1; | ||
|
|
||
| // This loop stops when among the loaded bytes we find the null byte | ||
| // of the string i.e., when first_set_bit >= 0 | ||
| for (size_t vl; first_set_bit < 0; src += vl, dst += vl) { | ||
| // Load up to vlmax elements if possible. | ||
| vuint8m8_t vec_src = __riscv_vle8ff_v_u8m8(src, &vl, vlmax); | ||
|
|
||
| // Mask that states where null bytes are in the loaded bytes. | ||
| vbool1_t string_terminate = __riscv_vmseq_vx_u8m8_b1(vec_src, 0, vl); | ||
|
|
||
| // If the null byte is not in the loaded bytes the resulting mask will | ||
| // be all ones, otherwise only the elements up to and including the | ||
| // first null byte of the resulting will be enabled. | ||
| vbool1_t mask = __riscv_vmsif_m_b1(string_terminate, vl); | ||
|
|
||
| // Store the enabled elements as determined by the mask above. | ||
| __riscv_vse8_v_u8m8_m(mask, dst, vec_src, vl); | ||
|
|
||
| // Determine if we found the null byte in the loaded bytes. | ||
| first_set_bit = __riscv_vfirst_m_b1(string_terminate, vl); | ||
| } | ||
| return destination; | ||
| } | ||
| ``` | ||
|
|
||
| ```c | ||
| void branch_rvv(double *a, double *b, double *c, int n, double constant) { | ||
| size_t vlmax = __riscv_vsetvlmax_e64m1(); | ||
| vfloat64m1_t vec_constant = __riscv_vfmv_v_f_f64m1(constant, vlmax); | ||
| for (size_t vl; n > 0; n -= vl, a += vl, b += vl, c += vl) { | ||
| vl = __riscv_vsetvl_e64m1(n); | ||
|
|
||
| // Load a[i..i+vl) | ||
| vfloat64m1_t vec_a = __riscv_vle64_v_f64m1(a, vl); | ||
| // Load b[i..i+vl) | ||
| vfloat64m1_t vec_b = __riscv_vle64_v_f64m1(b, vl); | ||
|
|
||
| // Compute a mask whose enabled elements will correspond to the | ||
| // elements of b that are not zero. | ||
| vbool64_t mask = __riscv_vmfne_vf_f64m1_b64(vec_b, 0.0, vl); | ||
|
|
||
| // Use mask undisturbed policy to compute the division for the | ||
| // elements enabled in the mask, otherwise set them to the given | ||
| // constant above (maskedoff). | ||
| vfloat64m1_t vec_c = __riscv_vfdiv_vv_f64m1_mu( | ||
| mask, /*maskedoff*/ vec_constant, vec_a, vec_b, vl); | ||
|
|
||
| // Store into c[i..i+vl) | ||
| __riscv_vse64_v_f64m1(c, vec_c, vl); | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ```c | ||
| void reduce_rvv(double *a, double *b, double *result_sum, int *result_count, | ||
| int n) { | ||
| int count = 0; | ||
| size_t vlmax = __riscv_vsetvlmax_e64m1(); | ||
| vfloat64m1_t vec_zero = __riscv_vfmv_v_f_f64m1(0.0, vlmax); | ||
| vfloat64m1_t vec_s = __riscv_vfmv_v_f_f64m1(0.0, vlmax); | ||
| for (size_t vl; n > 0; n -= vl, a += vl, b += vl) { | ||
| vl = __riscv_vsetvl_e64m1(n); | ||
|
|
||
| // Load a[i..i+vl) | ||
| vfloat64m1_t vec_a = __riscv_vle64_v_f64m1(a, vl); | ||
| // Load b[i..i+vl) | ||
| vfloat64m1_t vec_b = __riscv_vle64_v_f64m1(b, vl); | ||
|
|
||
| // Compute a mask whose enabled elements will correspond to the | ||
| // elements of a that are not 42. | ||
| vbool64_t mask = __riscv_vmfne_vf_f64m1_b64(vec_a, 42.0, vl); | ||
|
|
||
| // vec_s[e] ← vec_s[e] + vec_a[e] * vec_b[e], if mask[e] is enabled | ||
| vec_s = __riscv_vfmacc_vv_f64m1_tumu(mask, vec_s, vec_a, vec_b, vl); | ||
|
|
||
| // Adds to count the number of elements in mask that are enabled. | ||
| count += __riscv_vcpop_m_b64(mask, vl); | ||
| } | ||
|
|
||
| vfloat64m1_t vec_sum; | ||
| // Final accumulation. | ||
| vec_sum = __riscv_vfredusum_vs_f64m1_f64m1(vec_s, vec_zero, vlmax); | ||
| double sum = __riscv_vfmv_f_s_f64m1_f64(vec_sum); | ||
|
|
||
| // Return values. | ||
| *result_sum = sum; | ||
| *result_count = count; | ||
| } | ||
| ``` | ||
|
|
||
| ## rvvop.pdf:page_2 | ||
|
|
||
| SUMMARY: This document provides optimization guidelines for RISC-V Vector (RVV) intrinsics, focusing on LMUL selection, instruction variant preferences, and efficient memory access patterns for various data structures. | ||
|
|
||
| ```c | ||
| // Adding 1.0 to each element of an array of 32-bit floats | ||
| // (Note: Example assumes standard RVV intrinsic naming conventions) | ||
| vfloat32m1_t vec = vle32_v_f32m1(ptr, vl); | ||
| vec = vfadd_vf_f32m1(vec, 1.0f, vl); | ||
| ``` | ||
|
|
||
| ```c | ||
| // Broadcast 3 across all elements of the register group starting at v8 | ||
| vint32m1_t v8 = vmv_v_x_i32m1(3, vl); | ||
| ``` | ||
|
|
||
| ```c | ||
| // Splat alternating values of 0xaaaaaaaa and 0xbbbbbbbb into v2 using masked splat | ||
| vint32m1_t v2 = vmv_v_x_i32m1(0xaaaaaaaa, vl); | ||
| vbool32_t mask = vmsne_vx_i32m1_b32(vindex, 0, vl); // Assuming vindex defines the pattern | ||
| v2 = vfmerge_vxm_i32m1(v2, 0xbbbbbbbb, mask, vl); | ||
| ``` | ||
|
|
||
| ```c | ||
| // Set the first element of a vector register to 2 and the remaining elements to 0 | ||
| vint32m1_t v = vmv_v_i_i32m1(0, vl); | ||
| v = vmv_s_x_i32m1(v, 2, vl); | ||
| ``` | ||
|
|
||
| ```c | ||
| // Copying an array of bytes whose size is a multiple of 64kb using whole register loads/stores | ||
| // a0: destination, a1: source, a2: number of bytes | ||
| for (; a2 > 0; a2 -= vl) { | ||
| vl = vsetvlmax_e8m8(); | ||
| vint8m8_t data = vlse8_v_i8m8(a1, 1, vl); | ||
| vsse8_v_i8m8(a0, 1, data, vl); | ||
| a1 += vl; | ||
| a0 += vl; | ||
| } | ||
| ``` | ||
|
|
||
| ## rvvop.pdf:page_3 | ||
|
|
||
| SUMMARY: This document demonstrates how to use RISC-V Vector (RVV) unit-stride segment load instructions to unpack interleaved RGB data into separate color channels for grayscale conversion. It highlights the performance benefits of using vector-vector (.vv) instructions over scalar-vector variants to minimize register transfer overhead. | ||
|
|
||
| ```c | ||
| #include <riscv_vector.h> | ||
|
|
||
| void rgb_to_grayscale(const uint8_t *src, uint8_t *dst, size_t n) { | ||
| for (size_t vl; n > 0; n -= vl) { | ||
| vl = __riscv_vsetvl_e8m1(n); | ||
|
|
||
| // Load interleaved RGB data into three separate vector registers | ||
| vuint8m1x3_t rgb = __riscv_vlseg3e8_v_u8m1x3(src, vl); | ||
| vuint8m1_t r = __riscv_vget_v_u8m1x3_u8m1(rgb, 0); | ||
| vuint8m1_t g = __riscv_vget_v_u8m1x3_u8m1(rgb, 1); | ||
| vuint8m1_t b = __riscv_vget_v_u8m1x3_u8m1(rgb, 2); | ||
|
|
||
| // Compute grayscale: (R + G + B) / 3 (simplified example) | ||
| vuint8m1_t sum = __riscv_vadd_vv_u8m1(r, g, vl); | ||
| sum = __riscv_vadd_vv_u8m1(sum, b, vl); | ||
| vuint8m1_t gray = __riscv_vdivu_vx_u8m1(sum, 3, vl); | ||
|
|
||
| // Store the result using a unit-stride store | ||
| __riscv_vse8_v_u8m1(dst, gray, vl); | ||
|
|
||
| src += vl * 3; | ||
| dst += vl; | ||
| } | ||
| } | ||
| ``` |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.