-
-
Notifications
You must be signed in to change notification settings - Fork 883
Description
Summary
The new direct venom codegen (codegen_venom/) in PR #4811 produces ~2500 bytes more bytecode than the legacy pipeline on large contracts (e.g. meta_implementation_v_700). The dominant cause is excess mcopy instructions — 216 on the branch vs 95 on master for that contract.
Root cause
codegen_venom uses a simpler, more mechanical lowering than the legacy codegen. For internal function calls, it always:
- Allocates a staging buffer for each memory-passed argument, copies the argument into it, then passes the staging buffer pointer to the callee
- Allocates a fresh return buffer, passes it to the callee, then copies the result from the return buffer to the final destination(s)
This is correct by construction — the staging buffers prevent the callee frame from overwriting argument data, since ConcretizeMemLocPass can pack callee allocas into the same memory as caller allocas when liveness allows.
The legacy codegen avoids the extra copies by writing arguments directly to the callee frame ($calloca/$palloca mapping) and having callers use the return buffer in-place. This bakes optimization into the lowering, which is harder to verify.
Proposed fix: escape analysis pass
Rather than complicating the codegen, add a venom pass that eliminates the redundant copies. The pass should run before ConcretizeMemLocPass (while allocas are still abstract) but after the main optimization pipeline.
Core idea
For each alloca used as an invoke return buffer:
- Check that the alloca is only used by exactly one
invoke(as a memory write target) and one or moremcopyinstructions (as a source) - Check that the alloca does not escape — it is not passed to any other instruction that could observe it, stored to memory, etc.
- If the alloca has a single
mcopyconsumer: rewrite the invoke to use themcopydestination directly, eliminate themcopy, and let DCE remove the now-unused alloca - If the alloca has multiple
mcopyconsumers: rewrite the invoke to use the firstmcopydestination, rewrite remainingmcopysources to read from the first destination instead
The same analysis applies to argument staging buffers: if a staging alloca is only written by one mcopy/calldatacopy/codecopy and only read by one invoke, the intermediate buffer can be eliminated by passing the source directly.
Why this is safe
The key insight is that this operates on abstract allocas, not concrete memory offsets. The ConcretizeMemLocPass memory allocator uses liveness analysis to pack allocas — when the pass forwards a destination alloca into an invoke operand, the allocator sees that alloca as live at the invoke point and will not overlap it with the callee's frame allocas. The cross-function liveness tracking in MemLiveness._handle_liveat already handles this correctly:
if inst.opcode == "invoke":
fn = self.function.ctx.get_function(label)
live.addmany(self.mem_allocator.mems_used[fn]) # callee frame
for op in inst.operands:
live.add(ptr.base_alloca) # invoke operandsValidated data
On meta_implementation_v_700 at O3:
- 31 invoke→mcopy patterns identified as candidates for return buffer forwarding (runtime function only)
- The
mcopysource alloca in each case has exactly 1 invoke writer and 1+ mcopy readers MemoryCopyElisionPasscannot help here (validated experimentally — even with copy tracking never cleared, bytecode size is identical) because the copy source is written by the callee, not by a tracked copy instruction in the caller
What this won't fix
mcopyinstructions inside internal functions (callee-side copies to the return buffer). These are a smaller contributor and would require a separate optimization.- Argument staging copies where the source alloca escapes (e.g., is read by other instructions between the copy and the invoke).
Prototype implementation
"""
Invoke return buffer forwarding pass.
Eliminates redundant mcopy instructions after invoke by forwarding
the mcopy destination directly into the invoke's return buffer operand.
The codegen allocates a temporary alloca for each invoke return buffer,
then copies the result to the final destination(s) with mcopy. This pass
detects when the temporary alloca is only used as a return buffer for one
invoke and as a source for mcopy instructions, and eliminates the
intermediate buffer.
Pattern:
%ret_buf = alloca 64, N
invoke @fn, %ret_buf, ...
mcopy %dst, %ret_buf, 64 # only consumer
Transformed:
invoke @fn, %dst, ... # write directly to dst
(alloca and mcopy removed by DCE)
When there are multiple mcopy consumers:
invoke @fn, %ret_buf, ...
mcopy %dst1, %ret_buf, 64
mcopy %dst2, %ret_buf, 64
Transformed:
invoke @fn, %dst1, ... # write directly to first dst
mcopy %dst2, %dst1, 64 # remaining copies read from dst1
Safety: operates on abstract allocas before ConcretizeMemLocPass.
The memory allocator's liveness analysis already tracks callee frame
allocations at invoke points, so forwarding a destination alloca into
an invoke operand correctly prevents the allocator from overlapping
it with the callee's frame.
"""
from vyper.venom.analysis import BasePtrAnalysis, DFGAnalysis
from vyper.venom.basicblock import IRInstruction, IRLiteral, IRVariable
from vyper.venom.passes.base_pass import IRPass
from vyper.venom.passes.machinery.inst_updater import InstUpdater
class InvokeForwardPass(IRPass):
"""
Forward invoke return buffers to their mcopy destinations.
"""
# Must run before alloca concretization
required_successors = ("ConcretizeMemLocPass",)
def run_pass(self):
self.dfg = self.analyses_cache.request_analysis(DFGAnalysis)
self.base_ptrs = self.analyses_cache.request_analysis(BasePtrAnalysis)
self.updater = InstUpdater(self.dfg)
changed = False
for bb in self.function.get_basic_blocks():
for inst in bb.instructions.copy():
if inst.opcode == "alloca":
changed |= self._try_forward_alloca(inst)
if changed:
self.analyses_cache.invalidate_analysis(BasePtrAnalysis)
self.analyses_cache.invalidate_analysis(DFGAnalysis)
def _try_forward_alloca(self, alloca_inst: IRInstruction) -> bool:
alloca_var = alloca_inst.output
uses = self.dfg.get_uses(alloca_var)
# Classify all uses of this alloca
invoke_use = None
mcopy_uses = []
for use in uses:
if use.opcode == "invoke":
if invoke_use is not None:
return False # multiple invoke uses — not a simple return buffer
invoke_use = use
elif use.opcode == "mcopy":
# Check this alloca is the SOURCE of the mcopy (not the destination)
# mcopy operands: [size, src, dst]
_size, src, _dst = use.operands
if not isinstance(src, IRVariable) or src.name != alloca_var.name:
return False # alloca used as mcopy destination, not a return buffer
mcopy_uses.append(use)
else:
return False # alloca escapes to something other than invoke/mcopy
if invoke_use is None or len(mcopy_uses) == 0:
return False
# Verify all mcopy sizes match the alloca size
alloca_size = alloca_inst.operands[0]
assert isinstance(alloca_size, IRLiteral)
for mc in mcopy_uses:
mc_size = mc.operands[0]
if not isinstance(mc_size, IRLiteral):
return False
if mc_size.value != alloca_size.value:
return False
# Verify all mcopy destinations have a single known base pointer
# at offset 0 (i.e., they are plain alloca pointers, not GEP'd)
for mc in mcopy_uses:
_size, _src, dst = mc.operands
if not isinstance(dst, IRVariable):
return False
ptr = self.base_ptrs.ptr_from_op(dst)
if ptr is None or ptr.offset != 0:
return False
# Verify all mcopy's are in the same basic block as the invoke,
# and come after it.
invoke_bb = invoke_use.parent
try:
invoke_idx = invoke_bb.instructions.index(invoke_use)
except ValueError:
return False # pragma: nocover
for mc in mcopy_uses:
if mc.parent is not invoke_bb:
return False
try:
mc_idx = invoke_bb.instructions.index(mc)
except ValueError:
return False # pragma: nocover
if mc_idx <= invoke_idx:
return False
# Pick the first mcopy (by position) as the forwarding target
mcopy_uses.sort(key=lambda mc: invoke_bb.instructions.index(mc))
first_mc = mcopy_uses[0]
_size, _src, first_dst = first_mc.operands
# Rewrite the invoke: replace the alloca operand with first_dst
new_invoke_operands = []
for op in invoke_use.operands:
if isinstance(op, IRVariable) and op.name == alloca_var.name:
new_invoke_operands.append(first_dst)
else:
new_invoke_operands.append(op)
self.updater.update(invoke_use, "invoke", new_invoke_operands)
# Remove the first mcopy (invoke now writes directly to first_dst)
self.updater.nop(first_mc)
# Rewrite remaining mcopy's: change source from alloca to first_dst
for mc in mcopy_uses[1:]:
new_operands = [mc.operands[0], first_dst, mc.operands[2]]
self.updater.update(mc, "mcopy", new_operands)
return TrueReferences
- PR feat[venom]: add direct venom pipeline #4811 (direct venom pipeline)
vyper/venom/passes/memory_copy_elision.py— existing copy elision pass (cannot handle this pattern)vyper/venom/passes/concretize_mem_loc.py— alloca concretization with liveness-based packingvyper/codegen_venom/expr.py:_lower_internal_call— staging buffer allocation