Skip to content

venom: invoke return buffer forwarding via escape analysis #4847

@vyperteam-bot

Description

@vyperteam-bot

Summary

The new direct venom codegen (codegen_venom/) in PR #4811 produces ~2500 bytes more bytecode than the legacy pipeline on large contracts (e.g. meta_implementation_v_700). The dominant cause is excess mcopy instructions — 216 on the branch vs 95 on master for that contract.

Root cause

codegen_venom uses a simpler, more mechanical lowering than the legacy codegen. For internal function calls, it always:

  1. Allocates a staging buffer for each memory-passed argument, copies the argument into it, then passes the staging buffer pointer to the callee
  2. Allocates a fresh return buffer, passes it to the callee, then copies the result from the return buffer to the final destination(s)

This is correct by construction — the staging buffers prevent the callee frame from overwriting argument data, since ConcretizeMemLocPass can pack callee allocas into the same memory as caller allocas when liveness allows.

The legacy codegen avoids the extra copies by writing arguments directly to the callee frame ($calloca/$palloca mapping) and having callers use the return buffer in-place. This bakes optimization into the lowering, which is harder to verify.

Proposed fix: escape analysis pass

Rather than complicating the codegen, add a venom pass that eliminates the redundant copies. The pass should run before ConcretizeMemLocPass (while allocas are still abstract) but after the main optimization pipeline.

Core idea

For each alloca used as an invoke return buffer:

  1. Check that the alloca is only used by exactly one invoke (as a memory write target) and one or more mcopy instructions (as a source)
  2. Check that the alloca does not escape — it is not passed to any other instruction that could observe it, stored to memory, etc.
  3. If the alloca has a single mcopy consumer: rewrite the invoke to use the mcopy destination directly, eliminate the mcopy, and let DCE remove the now-unused alloca
  4. If the alloca has multiple mcopy consumers: rewrite the invoke to use the first mcopy destination, rewrite remaining mcopy sources to read from the first destination instead

The same analysis applies to argument staging buffers: if a staging alloca is only written by one mcopy/calldatacopy/codecopy and only read by one invoke, the intermediate buffer can be eliminated by passing the source directly.

Why this is safe

The key insight is that this operates on abstract allocas, not concrete memory offsets. The ConcretizeMemLocPass memory allocator uses liveness analysis to pack allocas — when the pass forwards a destination alloca into an invoke operand, the allocator sees that alloca as live at the invoke point and will not overlap it with the callee's frame allocas. The cross-function liveness tracking in MemLiveness._handle_liveat already handles this correctly:

if inst.opcode == "invoke":
    fn = self.function.ctx.get_function(label)
    live.addmany(self.mem_allocator.mems_used[fn])  # callee frame
    for op in inst.operands:
        live.add(ptr.base_alloca)  # invoke operands

Validated data

On meta_implementation_v_700 at O3:

  • 31 invoke→mcopy patterns identified as candidates for return buffer forwarding (runtime function only)
  • The mcopy source alloca in each case has exactly 1 invoke writer and 1+ mcopy readers
  • MemoryCopyElisionPass cannot help here (validated experimentally — even with copy tracking never cleared, bytecode size is identical) because the copy source is written by the callee, not by a tracked copy instruction in the caller

What this won't fix

  • mcopy instructions inside internal functions (callee-side copies to the return buffer). These are a smaller contributor and would require a separate optimization.
  • Argument staging copies where the source alloca escapes (e.g., is read by other instructions between the copy and the invoke).

Prototype implementation

"""
Invoke return buffer forwarding pass.

Eliminates redundant mcopy instructions after invoke by forwarding
the mcopy destination directly into the invoke's return buffer operand.

The codegen allocates a temporary alloca for each invoke return buffer,
then copies the result to the final destination(s) with mcopy. This pass
detects when the temporary alloca is only used as a return buffer for one
invoke and as a source for mcopy instructions, and eliminates the
intermediate buffer.

Pattern:
    %ret_buf = alloca 64, N
    invoke @fn, %ret_buf, ...
    mcopy %dst, %ret_buf, 64           # only consumer

Transformed:
    invoke @fn, %dst, ...              # write directly to dst
    (alloca and mcopy removed by DCE)

When there are multiple mcopy consumers:
    invoke @fn, %ret_buf, ...
    mcopy %dst1, %ret_buf, 64
    mcopy %dst2, %ret_buf, 64

Transformed:
    invoke @fn, %dst1, ...             # write directly to first dst
    mcopy %dst2, %dst1, 64             # remaining copies read from dst1

Safety: operates on abstract allocas before ConcretizeMemLocPass.
The memory allocator's liveness analysis already tracks callee frame
allocations at invoke points, so forwarding a destination alloca into
an invoke operand correctly prevents the allocator from overlapping
it with the callee's frame.
"""

from vyper.venom.analysis import BasePtrAnalysis, DFGAnalysis
from vyper.venom.basicblock import IRInstruction, IRLiteral, IRVariable
from vyper.venom.passes.base_pass import IRPass
from vyper.venom.passes.machinery.inst_updater import InstUpdater


class InvokeForwardPass(IRPass):
    """
    Forward invoke return buffers to their mcopy destinations.
    """

    # Must run before alloca concretization
    required_successors = ("ConcretizeMemLocPass",)

    def run_pass(self):
        self.dfg = self.analyses_cache.request_analysis(DFGAnalysis)
        self.base_ptrs = self.analyses_cache.request_analysis(BasePtrAnalysis)
        self.updater = InstUpdater(self.dfg)

        changed = False
        for bb in self.function.get_basic_blocks():
            for inst in bb.instructions.copy():
                if inst.opcode == "alloca":
                    changed |= self._try_forward_alloca(inst)

        if changed:
            self.analyses_cache.invalidate_analysis(BasePtrAnalysis)
            self.analyses_cache.invalidate_analysis(DFGAnalysis)

    def _try_forward_alloca(self, alloca_inst: IRInstruction) -> bool:
        alloca_var = alloca_inst.output
        uses = self.dfg.get_uses(alloca_var)

        # Classify all uses of this alloca
        invoke_use = None
        mcopy_uses = []

        for use in uses:
            if use.opcode == "invoke":
                if invoke_use is not None:
                    return False  # multiple invoke uses — not a simple return buffer
                invoke_use = use
            elif use.opcode == "mcopy":
                # Check this alloca is the SOURCE of the mcopy (not the destination)
                # mcopy operands: [size, src, dst]
                _size, src, _dst = use.operands
                if not isinstance(src, IRVariable) or src.name != alloca_var.name:
                    return False  # alloca used as mcopy destination, not a return buffer
                mcopy_uses.append(use)
            else:
                return False  # alloca escapes to something other than invoke/mcopy

        if invoke_use is None or len(mcopy_uses) == 0:
            return False

        # Verify all mcopy sizes match the alloca size
        alloca_size = alloca_inst.operands[0]
        assert isinstance(alloca_size, IRLiteral)
        for mc in mcopy_uses:
            mc_size = mc.operands[0]
            if not isinstance(mc_size, IRLiteral):
                return False
            if mc_size.value != alloca_size.value:
                return False

        # Verify all mcopy destinations have a single known base pointer
        # at offset 0 (i.e., they are plain alloca pointers, not GEP'd)
        for mc in mcopy_uses:
            _size, _src, dst = mc.operands
            if not isinstance(dst, IRVariable):
                return False
            ptr = self.base_ptrs.ptr_from_op(dst)
            if ptr is None or ptr.offset != 0:
                return False

        # Verify all mcopy's are in the same basic block as the invoke,
        # and come after it.
        invoke_bb = invoke_use.parent
        try:
            invoke_idx = invoke_bb.instructions.index(invoke_use)
        except ValueError:
            return False  # pragma: nocover

        for mc in mcopy_uses:
            if mc.parent is not invoke_bb:
                return False
            try:
                mc_idx = invoke_bb.instructions.index(mc)
            except ValueError:
                return False  # pragma: nocover
            if mc_idx <= invoke_idx:
                return False

        # Pick the first mcopy (by position) as the forwarding target
        mcopy_uses.sort(key=lambda mc: invoke_bb.instructions.index(mc))
        first_mc = mcopy_uses[0]
        _size, _src, first_dst = first_mc.operands

        # Rewrite the invoke: replace the alloca operand with first_dst
        new_invoke_operands = []
        for op in invoke_use.operands:
            if isinstance(op, IRVariable) and op.name == alloca_var.name:
                new_invoke_operands.append(first_dst)
            else:
                new_invoke_operands.append(op)
        self.updater.update(invoke_use, "invoke", new_invoke_operands)

        # Remove the first mcopy (invoke now writes directly to first_dst)
        self.updater.nop(first_mc)

        # Rewrite remaining mcopy's: change source from alloca to first_dst
        for mc in mcopy_uses[1:]:
            new_operands = [mc.operands[0], first_dst, mc.operands[2]]
            self.updater.update(mc, "mcopy", new_operands)

        return True

References

  • PR feat[venom]: add direct venom pipeline #4811 (direct venom pipeline)
  • vyper/venom/passes/memory_copy_elision.py — existing copy elision pass (cannot handle this pattern)
  • vyper/venom/passes/concretize_mem_loc.py — alloca concretization with liveness-based packing
  • vyper/codegen_venom/expr.py:_lower_internal_call — staging buffer allocation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions