Skip to content

Hybrid apc witgen #3482

@Schaeff

Description

@Schaeff

Context

Our current apc witgen relies on software witgen. Therefore, even if an apc provides a high reduction in columns and constraints, its cost in execution and tracegen is not reduced versus the software execution.

Witgen is split in two parts:

  • execution is sequential, and therefore optimised for speed. For each instruction, it computes and outputs the minimal amount of data required to derive a trace for this instruction call. We call this a record.
  • tracegen is parallel across all calls of a given instruction. For each record, it derives trace cells that prove that call.

We discussed ideas to alleviate this with

  • fully automatic tracegen
  • optimisations on top of software tracegen

This issue describes a way to make apc witgen (execution + tracegen) cheaper than the software equivalent using a hybrid approach, and ideally achieve similar savings to the ones we achieve in column count.

Current approach

Execution

Each time we call the apc, we generate one record for each original instruction in the block.

Tracegen

For each chip, we take all the records that point to it (in openvm, a record arena), and run tracegen for them into an intermediate table.
Then, we take these intermediate tables and copy over some of their columns to our final apc table.

Intuition

Tracegen

GPU

Following investigation by @qwang98, during gpu apc tracegen, 80% of the time is spent copying the records to the gpu, while the remaining 20% is spent actually running trace generation.

When it comes to the 20%, we can reduce that by 75% by removing the intermediate tables and running tracegen directly over the final apc table.

CPU

On CPU, we have not identified a way to reduce the cost of tracegen. We can assume that it is proportional to the number of records we have to process (the 20% of the GPU case).

Common

Whether on GPU or CPU, we pay a cost linear in the number of records to process: on GPU, because we have to send them to the device. On CPU, because we have to call tracegen for each record.

@qwang98 observed that it is possible that some original instruction has all of its columns removed during apc compilation. In this case, we could skip it entirely during trace generation.

Now, we could generalise this approach by introducing hints to derive some columns based on others. This is basically autotracegen, but we can do this in a partial way:

  • for example, we identify that instruction 3 in the block has all of its columns removed, except 2. During apc compilation, we can try to find a way to derive these two columns from other columns, (possibly limiting the search to instructions 1 and 2?), and add this to the derived columns which we already use for other purposes. As a result, we can skip sending one record per call for instruction 3 to the gpu (saving from the 80%) and instead rely on the symbolic derived column.
  • we identify that instruction 4 has 90% of its columns preserved in the apc. In this case, we can decide that it's fine to send over the record and run the software trace generation, instead of trying to find a way to derive the columns.

We end up with a hybrid trace generation strategy which can take advantage of both the software tracegen (which may be highly optimised for some instructions, and probably optimal when only a few columns get optimised away) and auto tracegen (which requires sending a fixed amount of data to the gpu).

The trade off between the two approaches is:

  • when running software, we need to send over more data (proportional to the number of calls of the apc), but we can use a more optimal trace generation, but some of the output will be discarded
  • when deriving, we need to send less data (constant in the number of calls of the apc) but we use a less optimal trace generation (as it's derived from the circuit), but none of the output will be discarded.

Execution

During execution, we currently generate records for all original instructions. However, if an original instruction's tracegen does not rely on a record, this is wasted effort, so in theory we should not generate a record at all.
In practice, this is hard, because the execution of the original instruction does not only create the record, it also updates the execution context, for example to keep track of the timestamp. We still need this latter work to happen. Therefore, the idea is to introduce a dummy record collector for each original table. During execution, if a given original instruction's record will not be used (because we want to rely on derived columns instead, or in the extreme case, all columns are removed), we can write it to this dummy record collector, which gets discarded after execution. As a result, the record collectors for each original table only contain records that are useful, and we minimize the amount of data to send to the gpu.

Open questions

  • For typical blocks, what is the distribution of the amount of removed columns during apc compilation? This would inform how much we can save in tracegen time.
  • How hard would it be to implement auto tracegen for an instruction which only has a few remaining columns?
  • How many times do we need to call the apc so that sending the derivation method (fixed cost) is cheaper than sending the records (variable cost in the number of calls)?

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions