venom: invoke return buffer forwarding via escape analysis

## Summary

The new direct venom codegen (`codegen_venom/`) in PR #4811 produces ~2500 bytes more bytecode than the legacy pipeline on large contracts (e.g. `meta_implementation_v_700`). The dominant cause is **excess `mcopy` instructions** — 216 on the branch vs 95 on master for that contract.

## Root cause

`codegen_venom` uses a simpler, more mechanical lowering than the legacy codegen. For internal function calls, it always:

1. Allocates a **staging buffer** for each memory-passed argument, copies the argument into it, then passes the staging buffer pointer to the callee
2. Allocates a fresh **return buffer**, passes it to the callee, then copies the result from the return buffer to the final destination(s)

This is correct by construction — the staging buffers prevent the callee frame from overwriting argument data, since `ConcretizeMemLocPass` can pack callee allocas into the same memory as caller allocas when liveness allows.

The legacy codegen avoids the extra copies by writing arguments directly to the callee frame (`$calloca`/`$palloca` mapping) and having callers use the return buffer in-place. This bakes optimization into the lowering, which is harder to verify.

## Proposed fix: escape analysis pass

Rather than complicating the codegen, add a **venom pass** that eliminates the redundant copies. The pass should run before `ConcretizeMemLocPass` (while allocas are still abstract) but after the main optimization pipeline.

### Core idea

For each `alloca` used as an invoke return buffer:

1. Check that the alloca is only **used** by exactly one `invoke` (as a memory write target) and one or more `mcopy` instructions (as a source)
2. Check that the alloca does not **escape** — it is not passed to any other instruction that could observe it, stored to memory, etc.
3. If the alloca has a single `mcopy` consumer: rewrite the invoke to use the `mcopy` destination directly, eliminate the `mcopy`, and let DCE remove the now-unused alloca
4. If the alloca has multiple `mcopy` consumers: rewrite the invoke to use the first `mcopy` destination, rewrite remaining `mcopy` sources to read from the first destination instead

The same analysis applies to **argument staging buffers**: if a staging alloca is only written by one `mcopy`/`calldatacopy`/`codecopy` and only read by one `invoke`, the intermediate buffer can be eliminated by passing the source directly.

### Why this is safe

The key insight is that this operates on abstract allocas, not concrete memory offsets. The `ConcretizeMemLocPass` memory allocator uses liveness analysis to pack allocas — when the pass forwards a destination alloca into an invoke operand, the allocator sees that alloca as live at the invoke point and will not overlap it with the callee's frame allocas. The cross-function liveness tracking in `MemLiveness._handle_liveat` already handles this correctly:

```python
if inst.opcode == "invoke":
    fn = self.function.ctx.get_function(label)
    live.addmany(self.mem_allocator.mems_used[fn])  # callee frame
    for op in inst.operands:
        live.add(ptr.base_alloca)  # invoke operands
```

### Validated data

On `meta_implementation_v_700` at O3:
- 31 invoke→mcopy patterns identified as candidates for return buffer forwarding (runtime function only)
- The `mcopy` source alloca in each case has exactly 1 invoke writer and 1+ mcopy readers
- `MemoryCopyElisionPass` cannot help here (validated experimentally — even with copy tracking never cleared, bytecode size is identical) because the copy source is written by the callee, not by a tracked copy instruction in the caller

### What this won't fix

- `mcopy` instructions **inside** internal functions (callee-side copies to the return buffer). These are a smaller contributor and would require a separate optimization.
- Argument staging copies where the source alloca escapes (e.g., is read by other instructions between the copy and the invoke).

## Prototype implementation

```python
"""
Invoke return buffer forwarding pass.

Eliminates redundant mcopy instructions after invoke by forwarding
the mcopy destination directly into the invoke's return buffer operand.

The codegen allocates a temporary alloca for each invoke return buffer,
then copies the result to the final destination(s) with mcopy. This pass
detects when the temporary alloca is only used as a return buffer for one
invoke and as a source for mcopy instructions, and eliminates the
intermediate buffer.

Pattern:
    %ret_buf = alloca 64, N
    invoke @fn, %ret_buf, ...
    mcopy %dst, %ret_buf, 64           # only consumer

Transformed:
    invoke @fn, %dst, ...              # write directly to dst
    (alloca and mcopy removed by DCE)

When there are multiple mcopy consumers:
    invoke @fn, %ret_buf, ...
    mcopy %dst1, %ret_buf, 64
    mcopy %dst2, %ret_buf, 64

Transformed:
    invoke @fn, %dst1, ...             # write directly to first dst
    mcopy %dst2, %dst1, 64             # remaining copies read from dst1

Safety: operates on abstract allocas before ConcretizeMemLocPass.
The memory allocator's liveness analysis already tracks callee frame
allocations at invoke points, so forwarding a destination alloca into
an invoke operand correctly prevents the allocator from overlapping
it with the callee's frame.
"""

from vyper.venom.analysis import BasePtrAnalysis, DFGAnalysis
from vyper.venom.basicblock import IRInstruction, IRLiteral, IRVariable
from vyper.venom.passes.base_pass import IRPass
from vyper.venom.passes.machinery.inst_updater import InstUpdater


class InvokeForwardPass(IRPass):
    """
    Forward invoke return buffers to their mcopy destinations.
    """

    # Must run before alloca concretization
    required_successors = ("ConcretizeMemLocPass",)

    def run_pass(self):
        self.dfg = self.analyses_cache.request_analysis(DFGAnalysis)
        self.base_ptrs = self.analyses_cache.request_analysis(BasePtrAnalysis)
        self.updater = InstUpdater(self.dfg)

        changed = False
        for bb in self.function.get_basic_blocks():
            for inst in bb.instructions.copy():
                if inst.opcode == "alloca":
                    changed |= self._try_forward_alloca(inst)

        if changed:
            self.analyses_cache.invalidate_analysis(BasePtrAnalysis)
            self.analyses_cache.invalidate_analysis(DFGAnalysis)

    def _try_forward_alloca(self, alloca_inst: IRInstruction) -> bool:
        alloca_var = alloca_inst.output
        uses = self.dfg.get_uses(alloca_var)

        # Classify all uses of this alloca
        invoke_use = None
        mcopy_uses = []

        for use in uses:
            if use.opcode == "invoke":
                if invoke_use is not None:
                    return False  # multiple invoke uses — not a simple return buffer
                invoke_use = use
            elif use.opcode == "mcopy":
                # Check this alloca is the SOURCE of the mcopy (not the destination)
                # mcopy operands: [size, src, dst]
                _size, src, _dst = use.operands
                if not isinstance(src, IRVariable) or src.name != alloca_var.name:
                    return False  # alloca used as mcopy destination, not a return buffer
                mcopy_uses.append(use)
            else:
                return False  # alloca escapes to something other than invoke/mcopy

        if invoke_use is None or len(mcopy_uses) == 0:
            return False

        # Verify all mcopy sizes match the alloca size
        alloca_size = alloca_inst.operands[0]
        assert isinstance(alloca_size, IRLiteral)
        for mc in mcopy_uses:
            mc_size = mc.operands[0]
            if not isinstance(mc_size, IRLiteral):
                return False
            if mc_size.value != alloca_size.value:
                return False

        # Verify all mcopy destinations have a single known base pointer
        # at offset 0 (i.e., they are plain alloca pointers, not GEP'd)
        for mc in mcopy_uses:
            _size, _src, dst = mc.operands
            if not isinstance(dst, IRVariable):
                return False
            ptr = self.base_ptrs.ptr_from_op(dst)
            if ptr is None or ptr.offset != 0:
                return False

        # Verify all mcopy's are in the same basic block as the invoke,
        # and come after it.
        invoke_bb = invoke_use.parent
        try:
            invoke_idx = invoke_bb.instructions.index(invoke_use)
        except ValueError:
            return False  # pragma: nocover

        for mc in mcopy_uses:
            if mc.parent is not invoke_bb:
                return False
            try:
                mc_idx = invoke_bb.instructions.index(mc)
            except ValueError:
                return False  # pragma: nocover
            if mc_idx <= invoke_idx:
                return False

        # Pick the first mcopy (by position) as the forwarding target
        mcopy_uses.sort(key=lambda mc: invoke_bb.instructions.index(mc))
        first_mc = mcopy_uses[0]
        _size, _src, first_dst = first_mc.operands

        # Rewrite the invoke: replace the alloca operand with first_dst
        new_invoke_operands = []
        for op in invoke_use.operands:
            if isinstance(op, IRVariable) and op.name == alloca_var.name:
                new_invoke_operands.append(first_dst)
            else:
                new_invoke_operands.append(op)
        self.updater.update(invoke_use, "invoke", new_invoke_operands)

        # Remove the first mcopy (invoke now writes directly to first_dst)
        self.updater.nop(first_mc)

        # Rewrite remaining mcopy's: change source from alloca to first_dst
        for mc in mcopy_uses[1:]:
            new_operands = [mc.operands[0], first_dst, mc.operands[2]]
            self.updater.update(mc, "mcopy", new_operands)

        return True
```

## References

- PR #4811 (direct venom pipeline)
- `vyper/venom/passes/memory_copy_elision.py` — existing copy elision pass (cannot handle this pattern)
- `vyper/venom/passes/concretize_mem_loc.py` — alloca concretization with liveness-based packing
- `vyper/codegen_venom/expr.py:_lower_internal_call` — staging buffer allocation


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

venom: invoke return buffer forwarding via escape analysis #4847

Summary

Root cause

Proposed fix: escape analysis pass

Core idea

Why this is safe

Validated data

What this won't fix

Prototype implementation

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

venom: invoke return buffer forwarding via escape analysis #4847

Description

Summary

Root cause

Proposed fix: escape analysis pass

Core idea

Why this is safe

Validated data

What this won't fix

Prototype implementation

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions