Description
In the Performance Tips for Frontend Authors section of the LLVM documentation the following suggestion is presented:
Avoid creating values of aggregate types (i.e. structs and arrays). In particular, avoid loading and storing them, or manipulating them with insertvalue and extractvalue instructions. Instead, only load and store individual fields of the aggregate.
There are some exceptions to this rule:It is fine to use values of aggregate type in global variable initializers.
It is fine to return structs, if this is done to represent the return of multiple values in registers.
It is fine to work with structs returned by LLVM intrinsics, such as the with.overflow family of intrinsics.
It is fine to use aggregate types without creating values. For example, they are commonly used in getelementptr instructions or attributes like sret.
The goal of this issue is to evaluate the performance impact of the suggestion. Specifically. when lowering hamiltonians, we first create an aggregate value and then store this aggregate value to an alloca pointer. We can avoid creating the aggregate value and store individual fields in the alloca pointer. In other words
func.func @hamiltonian(%obs : !quantum.obs, %p1 : memref<1xf64>, %p2 : memref<3xf64>) {
quantum.hamiltonian(%p1 : memref<1xf64>) %obs : !quantum.obs
return
}
Produces
module {
llvm.func @__catalyst__qis__HamiltonianObs(!llvm.ptr, i64, ...) -> i64
llvm.func @hamiltonian(%arg0: i64, %arg1: !llvm.ptr, %arg2: !llvm.ptr, %arg3: i64, %arg4: i64, %arg5: i64, %arg6: !llvm.ptr, %arg7: !llvm.ptr, %arg8: i64, %arg9: i64, %arg10: i64) {
%0 = llvm.mlir.constant(1 : i64) : i64
%1 = llvm.alloca %0 x !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)> : (i64) -> !llvm.ptr
%2 = llvm.mlir.undef : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%3 = llvm.insertvalue %arg1, %2[0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%4 = llvm.insertvalue %arg2, %3[1] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%5 = llvm.insertvalue %arg3, %4[2] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%6 = llvm.insertvalue %arg4, %5[3, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%7 = llvm.insertvalue %arg5, %6[4, 0] : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>
%8 = llvm.mlir.constant(1 : i64) : i64
llvm.store %7, %1 : !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>, !llvm.ptr
%9 = llvm.call @__catalyst__qis__HamiltonianObs(%1, %8, %arg0) vararg(!llvm.func<i64 (ptr, i64, ...)>) : (!llvm.ptr, i64, i64) -> i64
llvm.return
}
}
when we could produce:
module {
llvm.func @__catalyst__qis__HamiltonianObs(!llvm.ptr, i64, ...) -> i64
llvm.func @hamiltonian(%arg0: i64, %arg1: !llvm.ptr, %arg2: !llvm.ptr, %arg3: i64, %arg4: i64, %arg5: i64, %arg6: !llvm.ptr, %arg7: !llvm.ptr, %arg8: i64, %arg9: i64, %arg10: i64) {
%0 = llvm.mlir.constant(1 : i64) : i64
%1 = llvm.alloca %0 x !llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)> : (i64) -> !llvm.ptr
// get offsets with gep
// store values from args into gep
%9 = llvm.call @__catalyst__qis__HamiltonianObs(%1, %8, %arg0) vararg(!llvm.func<i64 (ptr, i64, ...)>) : (!llvm.ptr, i64, i64) -> i64
llvm.return
}
}
Completion of this issue should examine the following:
- Implement the suggested improvements to the
hamiltonian
function. - Conduct a performance evaluation of the two versions of the
hamiltonian
function. - Are the improvements suggested in Performance Tips for Frontend Authors](https://llvm.org/docs/Frontend/PerformanceTips.html#avoid-creating-values-of-aggregate-type) worthwhile in this situation? Explain why or why not?
- On initial inspection it is not obvious that replacing the instances of
llvm.insertvalue
withllvm.store
instructions would result in a positive improvement. If we find that it does in fact yield a speedup, why does it? For example, are there downstream optimizations that LLVM applies?