Skip to content

[wasm] Jiterpreter tracking issue #78428

Open
@kg

Description

@kg

The jiterpreter (#76477) has pending work needed:

  • Introduce a Jiterpreter CI lane that sets all the tiering thresholds low so that we flush out any issues with obscure interp opcodes or cold code
  • Investigate integrating jit calls directly into compiled traces
  • Investigate integrating icalls directly into compiled traces
  • Run statistics on blazor applications once the jiterpreter is integrated, to identify any instructions that need to be added
  • Remove more unnecessary transition/wrapper glue from do_jit_call and interp_entry paths, as seen here: image
    • Maintain a table of vtable slots containing interp_entry _in wrappers, then patch the vtables (design pending)
    • When generating dedicated do_jit_call routine, punch through the _out wrapper (model on the mini-generic-sharing.c code generator)
    • Optimize direct jit calls to turn the common ldloca sp + offset -> tnn.load pair into tnn_load offset
    • Optimize out passing of ftndesc arg to direct jit call wrappers, target and rgctx can be compiled in (not possible due to generic sharing)
  • Cache interpreter stack locals in wasm locals, then flush them back to the interpreter stack on exit
  • Cache non-volatile fields in wasm locals, then flush them back to the heap on exit
  • Threading support (incomplete draft to-do list)
    • Pre-grow function pointer table to a set size at startup in each thread
    • Ensure empty function pointer slots are filled with appropriate 'dummy' functions so that threads will not crash when calling them
    • When jitting a new function, RPC the wasm blob or compiled module to threads so they can register the pointer
    • Thread-safe interpreter opcode patching
    • Thread-safe do_jit_call pointer/cache updates
  • Multi-trace optimizations
    • For traces with an offset other than 0 (large ones only?) attempt to reuse other existing traces?
    • Stop compiling traces when we encounter an already-compiled trace (likely function prologue -> loop body)
    • When we encounter an already compiled trace, call it directly from the current trace
  • Heuristic improvements
    • Don't put trace entry points too close together
    • If a trace is likely to conditionally abort early in its execution, don't insert an entry point (requires interpreter to mark blocks as unlikely if they contain a throw)
    • Add 'estimated cost' value for each opcode to mintops.def that estimates cost of running it in interp
    • Add estimated cost value for each jiterpreter opcode that estimates the quality of generated wasm code
    • Instead of using trace length heuristic, only keep traces where estimated jiterpreter cost <= interp cost
    • Ensure new system keeps short high value traces like Vector128.Add-with-SIMD
    • Improve estimated jiterpreter cost by factoring in (measured on v8 and/or spidermonkey) cost of entering a trace
    • Factor in the lack of branch prediction when estimating cost of jiterpreter branches like null checks
    • Insert entry points periodically in very large basic blocks so that the jiterp can resume when a trace ends due to being too large
  • Control flow improvements
    • Basic backwards branch implementation
    • Implement CFG tracker that assembles module at the end
    • Eliminate branch block comparison(s) for forward branches
    • Eliminate branch block comparison(s) for backward branches
    • Don't generate dispatch table entries for branch targets that cannot be reached by backward branches
    • Don't generate a dispatch table if all back branches in a trace go to a single place
    • Identify cases where each back branch target is independent, and generate separate loops
    • Record each CALL_HANDLER target and use that to implement ENDFINALLY
    • When we emit an unconditional bailout, set a 'prune opcodes' flag and don't translate any unreachable opcodes after it until we hit a branch target block
    • Outline bailouts and exits to a shared return at the end of traces
    • Change all bailouts to be the form if (cond) { br bailout_block } or br_if bailout_block
  • Monitoring phase improvements
    • Tune threshold
    • Generate a mapping table from return values (we know the possible set) to executed opcode or uop count
    • Set threshold in terms of opcodes or uops
    • Discard mapping table after monitoring phase
  • Store-to-load forwarding
    • If a series of opcodes r/w overwrite a dreg, drop the store/load pair for the leading opcodes, i.e. a = b * 2; a = a + 1; (this turns out to make things slower in v8 for some reason, so prototype won't land)
    • Use a wasm local instead of leave-on-stack
    • Fully optimize out stores and loads for cases where the dreg is only read once by leaving it on the wasm stack
    • Forward constants from their most recent store to load(s) that use them ([wasm] Add limited constant propagation to the jiterpreter for ldc.i4 and ldloca #99706)
  • Re-enable early trace abort with back branches active but only once a trace is long enough to justify it
  • Add typecheck-free version of stelem_ref (only possible for sealed types, must be generated in interp) ([mono] Add unchecked version of stelem_ref interpreter opcode #99829)
  • Update the msbuild targets to generate a single export arg to emcc instead of one per exported function
  • Ensure IEEE spec compliance for the f32 and f64 opcodes that rely on libc or wasm opcodes
  • Cache the this-reference (locals[0]) in a wasm local since it can't change
  • Zero region optimizations
    • Fuse null check and length check for arrays
    • Fuse null check and length check for strings
    • Fuse null check and length check for spans
    • Fuse null check and type check for MINT_CASTCLASS/MINT_ISINST
  • Interpreter integration
    • Move cpblk unrolling into interpreter superinsn pass as mint_cpblk_imm
    • Add new null-check-free versions of hot field opcodes
    • Add new information table tracking things like known not-null state per local that are exposed to jiterpreter
    • Consume information table from jiterpreter to do null check elimination
    • Optimize size of null check bitset as described in [wasm] Re-enable null check optimization for mid-method traces #84058 (comment)
    • Investigate migrating the trace generator into transform.c and doing it during the tiering process
    • If interpreter verbose is set for a method the jiterpreter should honor that
  • SIMD
  • Raise interpreter inlining limit to 30
    • Investigate raising it a bit further
  • Caching / PGO
    • Record a list of which methods are tiered in the interp so they can tier immediately on future runs
    • Record a list of which traces we compile so that we can compile them early on future runs
    • Cache jitted traces across page loads
    • Cache do_jit_call trampolines across page loads
    • Cache interp_entry wrappers across page loads
  • Make sure that call_handler/leave work correctly in the event that we bail out from a trace into the interp ([wasm] Jiterpreter implementation of CALL_HANDLER is incorrect #98577)
  • Cleanup
    • Remove most jiterp cprop once we can rely on the interpreter to do it, for correctness reasons

Archived items

  • Write a custom assembler and use it to generate and inline do-jit-call and simd detect modules ( #81691 )
  • Also unroll memcpy like memset
  • Investigate possible startup time regressions
  • Investigate possible .wasm size regressions
  • Update memmove unroller to ensure it does the correct thing for overlapping src/dest
  • Enable jiterpreter jitcall and interp_entry JITs by default
  • Enable jiterpreter traces by default
  • Don't bail out for safepoints
    • Do the 'is a safepoint needed' check inline in the trace instead of in the import
  • Inline strlen into traces
  • Inline getchr_ref into traces
  • Inline getitem_span into traces
  • Inline get_element_address_with_size_ref into traces
  • Optimize out the eip local and initialization for traces containing no branches
  • Generate import section after generating function body and omit unused imports
  • Do another pass over intrinsics and superinsns to add any missing ones (like the log2 used for vectorization)
  • Remove generated opcode info table and fetch opcode info from the interpreter's tables on demand to reduce file size
  • Don't discard known not-null / known constant information when crossing branches, only branch targets
  • Migrate configuration to options.h (requires improvements to the API)
  • Verify that no debugging scenarios regress
  • Better error handling for jiterpreter runtime failures (shut them off after a handful of JIT failures to avoid spamming the console and wasting CPU time)
  • Optimize out memory.fill for common sizes (it produces an expensive function call on x86 and x64)
  • Handle jiterpreter opcodes in non-wasm interp using the same path as other unreachable opcodes
  • Fix floating point compares in jiterpreter

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions