Skip to content

Conversation

@georgwiese
Copy link
Collaborator

@georgwiese georgwiese commented Dec 18, 2025

Builds on #3562

This PR generates the "optimistic constraints" (which I'd prefer to call "execution constraints") introduced in #3491 for optimistic precompiles. They are currently ignored, actually passing them to the execution engine is left for another PR.

At a high level, this is what happens:

  1. optimistic_literals() computes a map AlgebraicReference -> OptimisticLiteral. It works by finding memory accesses with compile-time addresses (essentially register accesses). The columns representing the data in the memory bus interaction correspond to limbs of register values at some point in time and therefore can be mapped to an execution literal.
  2. BlockEmpiricalConstraints::filtered is used to remove any constraints on columns that cannot be mapped to execution literals. As a result, all empirical constraints can be checked at execution time, but the resulting optimistic precompiles are less effective.
  3. ConstraintGenerator::generate_constraints turns empirical constraints into equality constraints, i.e., constraints of the form (number|algebraic_reference) = (number|algebraic_reference). These constraints can be converted to SymbolicConstraint (to be added to the solver) and to execution constraints via generate_execution_constraints (using the map computed in step 1).

To test:
POWDR_RESTRICTED_OPTIMISTIC_PRECOMPILES=1 cargo run --bin powdr_openvm -r prove guest-keccak --input 100 --autoprecompiles 1 --apc-candidates-dir keccak100 --mock --optimistic-precompiles

Also see the evaluation on reth that I posted in #3366.

@georgwiese georgwiese force-pushed the optimistic-execution-constraints branch from 53b5d6c to 558856c Compare December 19, 2025 15:53
@georgwiese
Copy link
Collaborator Author

The code is pretty flashed out at this point, but it doesn't work, because I'm trying to find the memory limbs from the unoptimized machine. But at this point, the multiplicity still depends on is_valid (so I can't easily know which are sends and which are receives), even register addresses are still symbolic (coming from the dynamic PC lookup), and the values are often complex expression. One example:

(id=1, mult=2013265920 * is_valid, args=[is_load * mem_as + (1 - is_load) * 1, is_load * (mem_ptr_limbs__0 + mem_ptr_limbs__1 * 65536) + (1 - is_load) * rd_rs2_ptr - ((0 + flags__0 * (0 + flags__0 + flags__1 + flags__2 + flags__3 - 2) * 2013265920) * 1 + (0 + flags__2 * (flags__2 - 1) * 1006632961 + flags__1 * (0 + flags__0 + flags__1 + flags__2 + flags__3 - 2) * 2013265920) * 2 + (0 + flags__2 * (0 + flags__0 + flags__1 + flags__2 + flags__3 - 2) * 2013265920) * 3), read_data__0, read_data__1, read_data__2, read_data__3, read_data_aux__base__prev_timestamp])

I think what I want instead is to run the full optimization, except I want the optimizer to never remove a memory bus interaction (i.e., skip [this code]). Note that at this point, the solver might have actually figured out concrete values for some of the memory limbs, or might have already determined that some memory limbs are equal. The current algorithm should still work though in that case.

@georgwiese georgwiese force-pushed the optimistic-execution-constraints branch from a072147 to b1c42c0 Compare December 23, 2025 18:43
@georgwiese
Copy link
Collaborator Author

georgwiese commented Dec 23, 2025

OK, things are working, but the effectiveness is significantly reduced.

Analysis on block 0x201ecc = 2105036

  1. The "guaranteed" precompile has 131 columns.
  2. The optimistic precompile (only_memory_limbs = false, this is the state before this PR) has 83 columns.
  3. The filtered optimistic precompile (only_memory_limbs = true, only empirical constraints on memory pointer limbs are used) has 112 columns.

The 29 columns that are in (3) but not in (2) fall into different categories:

Category 1 (16 / 29): timestamp diff limbs

rs1_aux_cols__base__timestamp_lt_aux__lower_decomp__0_0
read_data_aux__base__timestamp_lt_aux__lower_decomp__0_0
write_base_aux__timestamp_lt_aux__lower_decomp__0_0
read_data_aux__base__timestamp_lt_aux__lower_decomp__0_1
write_base_aux__timestamp_lt_aux__lower_decomp__0_1
read_data_aux__base__timestamp_lt_aux__lower_decomp__0_2
write_base_aux__timestamp_lt_aux__lower_decomp__0_2
read_data_aux__base__timestamp_lt_aux__lower_decomp__0_3
write_base_aux__timestamp_lt_aux__lower_decomp__0_3
rs1_aux_cols__base__timestamp_lt_aux__lower_decomp__0_4
write_base_aux__timestamp_lt_aux__lower_decomp__0_4
write_base_aux__timestamp_lt_aux__lower_decomp__0_5
write_base_aux__timestamp_lt_aux__lower_decomp__0_6
write_base_aux__timestamp_lt_aux__lower_decomp__0_7
reads_aux__0__base__timestamp_lt_aux__lower_decomp__0_9
reads_aux__0__base__timestamp_lt_aux__lower_decomp__0_11

These can be removed because typically accessed memory cells have been accessed recently. Therefore, the most significant limb of the time stamp diff is 0. AFAIU, we don't have access to the current or previous timestamp (of a memory access) during execution, so this is an inherent limitation of the way we check optimistic precompiles at execution time.

Note that all this data is based on very small samples, so it could be that in larger examples, the limbs are actually nonzero much more often, and the empirical constraint would also not be included in (2). (In other words, the 83 columns in (2) might be the result of overfitting to a small sample.)

Category 2 (5 / 29): Derivable columns

For these columns, I believe the solver should be able to derive them deterministically from the given data.

The example that happens here is related to diff markers in the final comparison (BLTU 44 48 -44 1 1):

diff_marker__0_11
diff_marker__1_11
diff_marker__2_11
diff_marker__3_11
diff_val_11

What happens here is that empirically, register 44 is always 15, and register 48 is always a byte. The difference can only be in the final byte, so at least 3 diff markers should be inferred to be 0 (and the 4th could be inferred to equal cmp_result, which is also in (2)).

Category 3 (8 / 29): Not derivable, but checkable

For these columns, the solver is correct not to infer them from the given data, but we could still formulate (complicated) execution constraints to detect the cases.

The example that happens here is the most significant memory pointer limbs:

mem_ptr_limbs__1_0
mem_ptr_limbs__1_1
mem_ptr_limbs__1_2
mem_ptr_limbs__1_3
mem_ptr_limbs__1_4
mem_ptr_limbs__1_5
mem_ptr_limbs__1_6
mem_ptr_limbs__1_7

These are empirically always the same value because (at least in this sample) the addresses to be copied from / to always start with 0x2000. The least significant 16 bit vary. Note that even if they wouldn't be equal to a constant on a larger sample, they would likely be equal among each other (because the memory pointers are the results of a same base pointer plus some small compile-time offset, so likely the most significant 16-bit stay the same).

With the given data, it is not guaranteed that the most significant 16 bit of the memory pointer is always the same, so the solver is not to blame. But, we could actually carry out the addition as part of the execution constraints to check whether this is the case.

@leonardoalt
Copy link
Member

@georgwiese 3 has a typo I guess? only_memory_limbs = false -> only_memory_limbs = true

Base automatically changed from pgo-range-constraints to main December 26, 2025 20:07
@georgwiese georgwiese force-pushed the optimistic-execution-constraints branch 2 times, most recently from f02f260 to 7284971 Compare December 30, 2025 17:07
@georgwiese georgwiese changed the base branch from main to refactor-partition December 30, 2025 17:08
@georgwiese georgwiese force-pushed the optimistic-execution-constraints branch 2 times, most recently from 024039d to 7f65b56 Compare December 30, 2025 17:16
@georgwiese
Copy link
Collaborator Author

I ran on reth:

Base automatically changed from refactor-partition to main December 31, 2025 09:33
@georgwiese georgwiese force-pushed the optimistic-execution-constraints branch from 7f65b56 to 367085b Compare December 31, 2025 20:09
@georgwiese georgwiese changed the base branch from main to configurable-execution-count-threshold December 31, 2025 20:09
@georgwiese georgwiese force-pushed the optimistic-execution-constraints branch 2 times, most recently from 944ac87 to eabf566 Compare December 31, 2025 21:38
@georgwiese georgwiese force-pushed the optimistic-execution-constraints branch 4 times, most recently from eb0ed0f to 59c742c Compare December 31, 2025 23:19
Base automatically changed from configurable-execution-count-threshold to main January 2, 2026 14:40
github-merge-queue bot pushed a commit that referenced this pull request Jan 2, 2026
Cherry-picked from #3501

While running experiments for optimistic precompiles (#3366), I ran into
more memory allocation errors. This PR can be seen as a follow-up to
#3517.

The main idea here is that we no longer materialize the partition for
each block instance when detecting equivalence classes. Instead, the
iterator is passed on, so that new partitions are summarized as soon as
they arrive. See comments below for some nuance.
github-merge-queue bot pushed a commit that referenced this pull request Jan 2, 2026
Cherry-picked from #3501

When collecting empirical constraints, we used to materialize the entire
trace of a given PGO input. This turns out to be too memory-intensive.

We already have a way to combine empirical constraints that were
computed on different data sets, and use it to combine empirical
constraints from different PGO inputs. With this PR, we do take a more
granular approach: We keep at most 20 segments in memory (this is also
configurable), compute empirical constraints for every chunk of 20
segments, and combine between those chunks.

The result should be the same, except for some nuance in the range
constraints: Range constraints are computed as the 1st and 99th
percentile. When combining, we simply take the min of the minimums and
the max of the maximums. So, for example, it could be that in 20 chunks,
a PC is only executed once and has an extreme value. Then, it would
widen the range, even though it might not influence the percentiles if
they were computed globally.
@georgwiese georgwiese force-pushed the optimistic-execution-constraints branch from 59c742c to e1ca70b Compare January 5, 2026 15:51
@georgwiese
Copy link
Collaborator Author

Some analysis on the results presented in the evaluation section of #3366.

To copy the results, on 10 Ethereum blocks, we get:

  1. Guaranteed precompiles: 3.64 average effectiveness
  2. Optimistic precompiles (no restriction): 5.90 average effectiveness
  3. Execution-time checkable optimistic precompiles: 3.96 average effectiveness

Going from left to right in the effectiveness plot (guaranteed / restricted / optimistic), I looked into the columns that are in the optimistic but not restricted precompile:

  • 0x4e80e0 (36 / 29 / 21): All 8 columns are timestamp or timestamp diff columns.
  • 0x303110 (132 / 110 / 85): Mostly timestamp (diff) columns, some memory pointer limbs and previous value columns (for heap accesses)
  • 0x4e80d4 (54 / 53 / 49): All timestamp (diff) columns
  • 0x200a1c (132 / 126 / 117): One diff marker, otherwise timestamp (diff) columns
  • 0x33d124 (1809 / 1775 / 183): Mostly flags and (heap) memory data. Timestamp-related columns too.
  • 0x25c3c8 (878 / 834 / 505): Mostly memory flags and memory data, but also 32 columns that seem like they should correspond to intermediate registers, like a_mul__2_110.

This is largely consistent with the analysis above. Timestamp stuff accounts a lot with the performance drop. Another big one is that we filter out heap memory stuff where maybe we don't have to: For example, it could be that the value of a memory access is always 0 empirically, but we don't know the address at compile time. The address can be derived from (runtime) register values though.

@georgwiese georgwiese force-pushed the optimistic-execution-constraints branch from e1ca70b to febd77d Compare January 7, 2026 15:25
Base automatically changed from refactor-smybolic-machine-generator to main January 15, 2026 19:22
Copy link
Collaborator

@Schaeff Schaeff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review

pub enum LocalOptimisticLiteral<A> {
Register(A),
/// A limb of a register
// TODO: The code below ignores the limb index; support it properly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 50 to 52
// The optimizer might introduce new columns, but we'll discard them below.
// As a result, it is fine to clone here (and reuse column IDs across instructions).
column_allocator.clone(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this but I don't see how to fix it without a major refactor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also @chriseth said ColumnAllocator shouldn't be clonable the other day, don't remember why specifically. Can't this just be borrowed here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that was following another review of mine.
I'm not sure how to fix this in general, but maybe this at least reuses an existing api.

Comment on lines +479 to +494
let empirical_constraints = empirical_constraints.filtered(
|block_cell| {
let algebraic_reference = algebraic_references
.get_algebraic_reference(block_cell)
.unwrap();
optimistic_literals.contains_key(algebraic_reference)
},
<A::Instruction as PcStep>::pc_step(),
);

let empirical_constraints =
ConstraintGenerator::<A>::new(empirical_constraints, algebraic_references, &block)
.generate_constraints();

let execution_constraints =
generate_execution_constraints(&empirical_constraints, &optimistic_literals);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing the same thing in filtered and in generate_execution_constraints, the first time returning a filtered set and the second time panicking if something is not in the filtered set.
Also filtered is used only here.
I don't have a particular suggestion here but it seems like we could simplify this, maybe by going straight from BlockEmpiricalConstraints to (Vec<EqualityConstraint>, Vec<OptimisticConstraint>) instead of BlockEmpiricalConstraints -> Vec<EqualityConstraint> -> Vec<OptimisticConstraint>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had it like this, with the ConstraintGenerator generating both the symbolic constraints and the execution constraints. With the BlockEmpiricalConstraints -> Vec<EqualityConstraint> -> Vec<OptimisticConstraint>, things are more decoupled, which I like, especially since we don't care about execution constraints unless restricted precompiles are enabled.

In a context where we already use maps for cases where we only have data
for some keys (pcs...) it seems better to also apply this to the range
constraints.
})
// Map each limb reference to an optimistic literal
.flat_map(|(instruction_idx, concrete_address, limbs)| {
// Borrow column allocator to avoid moving it into the closure
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, as ChatGPT would say: "Nothing you’ve done is a hack — this is standard Rust ownership choreography."


// Generate constraints for optimistic precompiles.
let should_generate_execution_constraints =
optimistic_precompile_config().restrict_optimistic_precompiles;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started here #3567

@@ -0,0 +1,32 @@
const DEFAULT_EXECUTION_COUNT_THRESHOLD: u64 = 100;
const DEFAULT_MAX_SEGMENTS: usize = 20;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of segments is this referring to?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

execution segments / shards, there is a description of the config field below.

@@ -0,0 +1,32 @@
const DEFAULT_EXECUTION_COUNT_THRESHOLD: u64 = 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count of what?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See description of the config field below

}

pub fn optimistic_precompile_config() -> OptimisticPrecompileConfig {
let execution_count_threshold = std::env::var("POWDR_OP_EXECUTION_COUNT_THRESHOLD")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid env vars for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle I agree, but I think it's fine while the feature is highly experimental. Seems pointless to pollute the CLI and having to change reth constantly for a feature (optimistic precompiles) that is not working yet end-to-end. For this parameter, I expect that we can automatically set it in the future, but this was the easiest to run some quick experiments. Also, note that this is an env var before this PR, so I think we can fix this separately.

.ok()
.and_then(|s| s.parse().ok())
.unwrap_or(DEFAULT_MAX_SEGMENTS);
let restricted_optimistic_precompiles =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think especially for this we should avoid env var.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why especially here? In the end, all optimistic precompiles should always be restricted, the unrestricted version is only for us to know how good we'd be without restrictions. One could argue that for this reason, restricted optimistic APCs should be opt-out, not opt-in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant especially here because the limits have sane default constants

Comment on lines 50 to 52
// The optimizer might introduce new columns, but we'll discard them below.
// As a result, it is fine to clone here (and reuse column IDs across instructions).
column_allocator.clone(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also @chriseth said ColumnAllocator shouldn't be clonable the other day, don't remember why specifically. Can't this just be borrowed here?


let instruction_idx = match bus_interaction.op() {
MemoryOp::GetPrevious => instruction_idx,
MemoryOp::SetNew => instruction_idx + 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why + 1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we fetch a value at a given instruction index, the semantics is that it is the value before the instruction is executed. So when we have a memory bus receive (GetPrevious), we want to fetch the value at the current instruction index, and for a memory bus send (SetNew), we want to match it whatever value we fetch in the next instruction.

&vm_config.bus_map,
);

symbolic_machines
Copy link
Member

@leonardoalt leonardoalt Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this entire sequence kinda hard to read and visually polluted.
There's an iterator chain, with a nested iterator chain and multiple nested flat maps with several closures, and the whole thing is 100+ lines long. I find it hard to parse what's inside what and what's going where being consumed by what.
If I go line by line I can understand what's happening, but I personally wouldn't necessarily prioritize iterator chain maximalism over human readability.

Copy link
Member

@leonardoalt leonardoalt Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more accurate, I think my issue is rather with closure maximalism.
If this block was

    symbolic_machines
        .into_iter()
        .enumerate()
        .flat_map(f)
        .flat_map(g)
        .flat_map(h)

with more readable function descriptions I think I would personally find it a lot more readable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this? 729e997

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@georgwiese georgwiese changed the base branch from main to limb-access January 16, 2026 17:45
Base automatically changed from limb-access to main January 16, 2026 18:28
@georgwiese georgwiese force-pushed the optimistic-execution-constraints branch from 9b5eefb to ebce803 Compare January 16, 2026 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants