Skip to content

[WebAssembly] Align bulk-memory thresholds #134816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sparker-arm
Copy link
Contributor

Use the same thresholds for memcpy/move/set when optimizing for size or otherwise. The high-level nature of the memory instructions can allow runtimes to optimize them more readily.

Use the same thresholds for memcpy/move/set when optimizing for size
or otherwise. The high-level nature of the memory instructions can
allow runtimes to optimize them more readily.
@llvmbot
Copy link
Member

llvmbot commented Apr 8, 2025

@llvm/pr-subscribers-backend-webassembly

Author: Sam Parker (sparker-arm)

Changes

Use the same thresholds for memcpy/move/set when optimizing for size or otherwise. The high-level nature of the memory instructions can allow runtimes to optimize them more readily.


Full diff: https://github.com/llvm/llvm-project/pull/134816.diff

3 Files Affected:

  • (modified) llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp (+11)
  • (modified) llvm/test/CodeGen/WebAssembly/bulk-memory.ll (+118)
  • (modified) llvm/test/CodeGen/WebAssembly/bulk-memory64.ll (+118)
diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
index 794db887bd073..b733312d855bf 100644
--- a/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
+++ b/llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
@@ -394,6 +394,17 @@ WebAssemblyTargetLowering::WebAssemblyTargetLowering(
   // is equivalent to a simple branch. This reduces code size for wasm, and we
   // defer possible jump table optimizations to the VM.
   setMinimumJumpTableEntries(2);
+
+  // Align bulk memory usage when optimizing for size or otherwise. As well as
+  // reducing code size, prefering high-level primitives can make it easier for
+  // runtimes to make optimisations, especially when explicit bounds checking is
+  // employed.
+  if (Subtarget->hasBulkMemory()) {
+    MaxStoresPerMemset = MaxStoresPerMemsetOptSize;
+    MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize;
+    MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize;
+    MaxLoadsPerMemcmp = MaxLoadsPerMemcmpOptSize;
+  }
 }
 
 MVT WebAssemblyTargetLowering::getPointerTy(const DataLayout &DL,
diff --git a/llvm/test/CodeGen/WebAssembly/bulk-memory.ll b/llvm/test/CodeGen/WebAssembly/bulk-memory.ll
index ae170d757a305..d154c44856f8b 100644
--- a/llvm/test/CodeGen/WebAssembly/bulk-memory.ll
+++ b/llvm/test/CodeGen/WebAssembly/bulk-memory.ll
@@ -29,6 +29,27 @@ define void @memcpy_i8(ptr %dest, ptr %src, i8 zeroext %len) {
   ret void
 }
 
+; CHECK-LABEL: memcpy_i8_fixed_32
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.load
+; CHECK: i64.store
+define void @memcpy_i8_fixed_32(ptr %dest, ptr %src) {
+  call void @llvm.memcpy.p0.p0.i8(ptr %dest, ptr %src, i8 32, i1 0)
+  ret void
+}
+
+; CHECK-LABEL: memcpy_i8_fixed_36
+; BULK-MEM: memory.copy
+define void @memcpy_i8_fixed_36(ptr %dest, ptr %src) {
+  call void @llvm.memcpy.p0.p0.i8(ptr %dest, ptr %src, i8 36, i1 0)
+  ret void
+}
+
 ; CHECK-LABEL: memmove_i8:
 ; NO-BULK-MEM-NOT: memory.copy
 ; BULK-MEM-NEXT: .functype memmove_i8 (i32, i32, i32) -> ()
@@ -44,6 +65,27 @@ define void @memmove_i8(ptr %dest, ptr %src, i8 zeroext %len) {
   ret void
 }
 
+; CHECK-LABEL: memmove_i8_fixed_32
+; CHECK: i64.load
+; CHECK: i64.load
+; CHECK: i64.load
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+define void @memmove_i8_fixed_32(ptr %dest, ptr %src) {
+  call void @llvm.memmove.p0.p0.i8(ptr %dest, ptr %src, i8 32, i1 0)
+  ret void
+}
+
+; CHECK-LABEL: memmove_i8_fixed_36
+; BULK-MEM: memory.copy
+define void @memmove_i8_fixed_36(ptr %dest, ptr %src) {
+  call void @llvm.memmove.p0.p0.i8(ptr %dest, ptr %src, i8 36, i1 0)
+  ret void
+}
+
 ; CHECK-LABEL: memset_i8:
 ; NO-BULK-MEM-NOT: memory.fill
 ; BULK-MEM-NEXT: .functype memset_i8 (i32, i32, i32) -> ()
@@ -59,6 +101,23 @@ define void @memset_i8(ptr %dest, i8 %val, i8 zeroext %len) {
   ret void
 }
 
+; CHECK-LABEL: memset_i8_fixed_32
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+define void @memset_i8_fixed_32(ptr %dest, i8 %val) {
+  call void @llvm.memset.p0.i8(ptr %dest, i8 %val, i8 32, i1 0)
+  ret void
+}
+
+; CHECK-LABEL: memset_i8_fixed_36
+; BULK-MEM: memory.fill
+define void @memset_i8_fixed_36(ptr %dest, i8 %val) {
+  call void @llvm.memset.p0.i8(ptr %dest, i8 %val, i8 36, i1 0)
+  ret void
+}
+
 ; CHECK-LABEL: memcpy_i32:
 ; NO-BULK-MEM-NOT: memory.copy
 ; BULK-MEM-NEXT: .functype memcpy_i32 (i32, i32, i32) -> ()
@@ -74,6 +133,27 @@ define void @memcpy_i32(ptr %dest, ptr %src, i32 %len) {
   ret void
 }
 
+; CHECK-LABEL: memcpy_i32_fixed_32
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.load
+; CHECK: i64.store
+define void @memcpy_i32_fixed_32(ptr %dest, ptr %src) {
+  call void @llvm.memcpy.p0.p0.i32(ptr %dest, ptr %src, i32 32, i1 0)
+  ret void
+}
+
+; CHECK-LABEL: memcpy_i32_fixed_36
+; BULK-MEM: memory.copy
+define void @memcpy_i32_fixed_36(ptr %dest, ptr %src) {
+  call void @llvm.memcpy.p0.p0.i32(ptr %dest, ptr %src, i32 36, i1 0)
+  ret void
+}
+
 ; CHECK-LABEL: memmove_i32:
 ; NO-BULK-MEM-NOT: memory.copy
 ; BULK-MEM-NEXT: .functype memmove_i32 (i32, i32, i32) -> ()
@@ -89,6 +169,27 @@ define void @memmove_i32(ptr %dest, ptr %src, i32 %len) {
   ret void
 }
 
+; CHECK-LABEL: memmove_i32_fixed_32
+; CHECK: i64.load
+; CHECK: i64.load
+; CHECK: i64.load
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+define void @memmove_i32_fixed_32(ptr %dest, ptr %src) {
+  call void @llvm.memmove.p0.p0.i32(ptr %dest, ptr %src, i32 32, i1 0)
+  ret void
+}
+
+; CHECK-LABEL: memmove_i32_fixed_36
+; BULK-MEM: memory.copy
+define void @memmove_i32_fixed_36(ptr %dest, ptr %src) {
+  call void @llvm.memmove.p0.p0.i32(ptr %dest, ptr %src, i32 36, i1 0)
+  ret void
+}
+
 ; CHECK-LABEL: memset_i32:
 ; NO-BULK-MEM-NOT: memory.fill
 ; BULK-MEM-NEXT: .functype memset_i32 (i32, i32, i32) -> ()
@@ -104,6 +205,23 @@ define void @memset_i32(ptr %dest, i8 %val, i32 %len) {
   ret void
 }
 
+; CHECK-LABEL: memset_i32_fixed_32
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+define void @memset_i32_fixed_32(ptr %dest, i8 %val) {
+  call void @llvm.memset.p0.i32(ptr %dest, i8 %val, i32 32, i1 0)
+  ret void
+}
+
+; CHECK-LABEL: memset_i32_fixed_36
+; BULK-MEM: memory.fill
+define void @memset_i32_fixed_36(ptr %dest, i8 %val) {
+  call void @llvm.memset.p0.i32(ptr %dest, i8 %val, i32 36, i1 0)
+  ret void
+}
+
 ; CHECK-LABEL: memcpy_1:
 ; CHECK-NEXT: .functype memcpy_1 (i32, i32) -> ()
 ; CHECK-NEXT: i32.load8_u $push[[L0:[0-9]+]]=, 0($1)
diff --git a/llvm/test/CodeGen/WebAssembly/bulk-memory64.ll b/llvm/test/CodeGen/WebAssembly/bulk-memory64.ll
index 0cf8493a995f9..910e7ac5c96c4 100644
--- a/llvm/test/CodeGen/WebAssembly/bulk-memory64.ll
+++ b/llvm/test/CodeGen/WebAssembly/bulk-memory64.ll
@@ -31,6 +31,27 @@ define void @memcpy_i8(ptr %dest, ptr %src, i8 zeroext %len) {
   ret void
 }
 
+; CHECK-LABEL: memcpy_i8_fixed_32
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.load
+; CHECK: i64.store
+define void @memcpy_i8_fixed_32(ptr %dest, ptr %src) {
+  call void @llvm.memcpy.p0.p0.i8(ptr %dest, ptr %src, i8 32, i1 0)
+  ret void
+}
+
+; CHECK-LABEL: memcpy_i8_fixed_36
+; BULK-MEM: memory.copy
+define void @memcpy_i8_fixed_36(ptr %dest, ptr %src) {
+  call void @llvm.memcpy.p0.p0.i8(ptr %dest, ptr %src, i8 36, i1 0)
+  ret void
+}
+
 ; CHECK-LABEL: memmove_i8:
 ; NO-BULK-MEM-NOT: memory.copy
 ; BULK-MEM-NEXT: .functype memmove_i8 (i64, i64, i32) -> ()
@@ -48,6 +69,27 @@ define void @memmove_i8(ptr %dest, ptr %src, i8 zeroext %len) {
   ret void
 }
 
+; CHECK-LABEL: memmove_i8_fixed_32
+; CHECK: i64.load
+; CHECK: i64.load
+; CHECK: i64.load
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+define void @memmove_i8_fixed_32(ptr %dest, ptr %src) {
+  call void @llvm.memmove.p0.p0.i8(ptr %dest, ptr %src, i8 32, i1 0)
+  ret void
+}
+
+; CHECK-LABEL: memmove_i8_fixed_36
+; BULK-MEM: memory.copy
+define void @memmove_i8_fixed_36(ptr %dest, ptr %src) {
+  call void @llvm.memmove.p0.p0.i8(ptr %dest, ptr %src, i8 36, i1 0)
+  ret void
+}
+
 ; CHECK-LABEL: memset_i8:
 ; NO-BULK-MEM-NOT: memory.fill
 ; BULK-MEM-NEXT: .functype memset_i8 (i64, i32, i32) -> ()
@@ -65,6 +107,23 @@ define void @memset_i8(ptr %dest, i8 %val, i8 zeroext %len) {
   ret void
 }
 
+; CHECK-LABEL: memset_i8_fixed_32
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+define void @memset_i8_fixed_32(ptr %dest, i8 %val) {
+  call void @llvm.memset.p0.i8(ptr %dest, i8 %val, i8 32, i1 0)
+  ret void
+}
+
+; CHECK-LABEL: memset_i8_fixed_36
+; BULK-MEM: memory.fill
+define void @memset_i8_fixed_36(ptr %dest, i8 %val) {
+  call void @llvm.memset.p0.i8(ptr %dest, i8 %val, i8 36, i1 0)
+  ret void
+}
+
 ; CHECK-LABEL: memcpy_i32:
 ; NO-BULK-MEM-NOT: memory.copy
 ; BULK-MEM-NEXT: .functype memcpy_i32 (i64, i64, i64) -> ()
@@ -80,6 +139,27 @@ define void @memcpy_i32(ptr %dest, ptr %src, i64 %len) {
   ret void
 }
 
+; CHECK-LABEL: memcpy_i32_fixed_32
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.load
+; CHECK: i64.store
+define void @memcpy_i32_fixed_32(ptr %dest, ptr %src) {
+  call void @llvm.memcpy.p0.p0.i32(ptr %dest, ptr %src, i32 32, i1 0)
+  ret void
+}
+
+; CHECK-LABEL: memcpy_i32_fixed_36
+; BULK-MEM: memory.copy
+define void @memcpy_i32_fixed_36(ptr %dest, ptr %src) {
+  call void @llvm.memcpy.p0.p0.i32(ptr %dest, ptr %src, i32 36, i1 0)
+  ret void
+}
+
 ; CHECK-LABEL: memmove_i32:
 ; NO-BULK-MEM-NOT: memory.copy
 ; BULK-MEM-NEXT: .functype memmove_i32 (i64, i64, i64) -> ()
@@ -95,6 +175,27 @@ define void @memmove_i32(ptr %dest, ptr %src, i64 %len) {
   ret void
 }
 
+; CHECK-LABEL: memmove_i32_fixed_32
+; CHECK: i64.load
+; CHECK: i64.load
+; CHECK: i64.load
+; CHECK: i64.load
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+define void @memmove_i32_fixed_32(ptr %dest, ptr %src) {
+  call void @llvm.memmove.p0.p0.i32(ptr %dest, ptr %src, i32 32, i1 0)
+  ret void
+}
+
+; CHECK-LABEL: memmove_i32_fixed_36
+; BULK-MEM: memory.copy
+define void @memmove_i32_fixed_36(ptr %dest, ptr %src) {
+  call void @llvm.memmove.p0.p0.i32(ptr %dest, ptr %src, i32 36, i1 0)
+  ret void
+}
+
 ; CHECK-LABEL: memset_i32:
 ; NO-BULK-MEM-NOT: memory.fill
 ; BULK-MEM-NEXT: .functype memset_i32 (i64, i32, i64) -> ()
@@ -110,6 +211,23 @@ define void @memset_i32(ptr %dest, i8 %val, i64 %len) {
   ret void
 }
 
+; CHECK-LABEL: memset_i32_fixed_32
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+; CHECK: i64.store
+define void @memset_i32_fixed_32(ptr %dest, i8 %val) {
+  call void @llvm.memset.p0.i32(ptr %dest, i8 %val, i32 32, i1 0)
+  ret void
+}
+
+; CHECK-LABEL: memset_i32_fixed_36
+; BULK-MEM: memory.fill
+define void @memset_i32_fixed_36(ptr %dest, i8 %val) {
+  call void @llvm.memset.p0.i32(ptr %dest, i8 %val, i32 36, i1 0)
+  ret void
+}
+
 ; CHECK-LABEL: memcpy_1:
 ; CHECK-NEXT: .functype memcpy_1 (i64, i64) -> ()
 ; CHECK-NEXT: i32.load8_u $push[[L0:[0-9]+]]=, 0($1)

@dschuff
Copy link
Member

dschuff commented Apr 8, 2025

Have you actually run any tests on any of the engines? Emscripten only actually uses bulk memory for larger memcpy by default because there's a bit of overhead compared to just using loads and stores for small memcpy. It was my intention to do some more comprehensive testing (especially with SIMD available) but I hadn't gotten around to it.

@sparker-arm
Copy link
Contributor Author

Yes, I ran my benchmark suite for V8 on x86_64 and AArch64 Linux. This is actually, slightly, beneficial for AArch64 (~1%) and, a slight, regression for x64 (~1%). This is due to the fact that explicit bounds checks are still used for most stores on AArch64 Linux.

The only inherent overhead of these operations are the bounds checks before the memory accesses, but given that the high-level nature of the instructions should allow the use of wide vector extensions, such as AVX, I expect that small overhead would be overcome. I'm currently looking at ways of optimising this in V8.

My setup is all wasi based so I can't get number for memory64, but I have a strong suspicion that this change would benefit there as explicit bounds checks are likely required more often. So, the overhead of bulk-memory ends up being far less than for a sequence of loads and stores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants