Open
Description
The jiterpreter (#76477) has pending work needed:
- Introduce a Jiterpreter CI lane that sets all the tiering thresholds low so that we flush out any issues with obscure interp opcodes or cold code
- Investigate integrating jit calls directly into compiled traces
- Investigate integrating icalls directly into compiled traces
- Run statistics on blazor applications once the jiterpreter is integrated, to identify any instructions that need to be added
- Remove more unnecessary transition/wrapper glue from do_jit_call and interp_entry paths, as seen here:
- Maintain a table of vtable slots containing interp_entry _in wrappers, then patch the vtables (design pending)
- When generating dedicated do_jit_call routine, punch through the _out wrapper (model on the mini-generic-sharing.c code generator)
- Optimize direct jit calls to turn the common ldloca sp + offset -> tnn.load pair into tnn_load offset
- Optimize out passing of ftndesc arg to direct jit call wrappers, target and rgctx can be compiled in (not possible due to generic sharing)
- Cache interpreter stack locals in wasm locals, then flush them back to the interpreter stack on exit
- Cache non-volatile fields in wasm locals, then flush them back to the heap on exit
- Threading support (incomplete draft to-do list)
- Pre-grow function pointer table to a set size at startup in each thread
- Ensure empty function pointer slots are filled with appropriate 'dummy' functions so that threads will not crash when calling them
- When jitting a new function, RPC the wasm blob or compiled module to threads so they can register the pointer
- Thread-safe interpreter opcode patching
- Thread-safe do_jit_call pointer/cache updates
- Multi-trace optimizations
- For traces with an offset other than 0 (large ones only?) attempt to reuse other existing traces?
- Stop compiling traces when we encounter an already-compiled trace (likely function prologue -> loop body)
- When we encounter an already compiled trace, call it directly from the current trace
- Heuristic improvements
- Don't put trace entry points too close together
- If a trace is likely to conditionally abort early in its execution, don't insert an entry point (requires interpreter to mark blocks as unlikely if they contain a throw)
- Add 'estimated cost' value for each opcode to mintops.def that estimates cost of running it in interp
- Add estimated cost value for each jiterpreter opcode that estimates the quality of generated wasm code
- Instead of using trace length heuristic, only keep traces where estimated jiterpreter cost <= interp cost
- Ensure new system keeps short high value traces like Vector128.Add-with-SIMD
- Improve estimated jiterpreter cost by factoring in (measured on v8 and/or spidermonkey) cost of entering a trace
- Factor in the lack of branch prediction when estimating cost of jiterpreter branches like null checks
- Insert entry points periodically in very large basic blocks so that the jiterp can resume when a trace ends due to being too large
- Control flow improvements
- Basic backwards branch implementation
- Implement CFG tracker that assembles module at the end
- Eliminate branch block comparison(s) for forward branches
- Eliminate branch block comparison(s) for backward branches
- Don't generate dispatch table entries for branch targets that cannot be reached by backward branches
- Don't generate a dispatch table if all back branches in a trace go to a single place
- Identify cases where each back branch target is independent, and generate separate loops
- Record each CALL_HANDLER target and use that to implement ENDFINALLY
- When we emit an unconditional bailout, set a 'prune opcodes' flag and don't translate any unreachable opcodes after it until we hit a branch target block
- Outline bailouts and exits to a shared return at the end of traces
- Change all bailouts to be the form
if (cond) { br bailout_block }
orbr_if bailout_block
- Monitoring phase improvements
- Tune threshold
- Generate a mapping table from return values (we know the possible set) to executed opcode or uop count
- Set threshold in terms of opcodes or uops
- Discard mapping table after monitoring phase
- Store-to-load forwarding
- If a series of opcodes r/w overwrite a dreg, drop the store/load pair for the leading opcodes, i.e.
a = b * 2; a = a + 1;
(this turns out to make things slower in v8 for some reason, so prototype won't land) - Use a wasm local instead of leave-on-stack
- Fully optimize out stores and loads for cases where the dreg is only read once by leaving it on the wasm stack
- Forward constants from their most recent store to load(s) that use them ([wasm] Add limited constant propagation to the jiterpreter for ldc.i4 and ldloca #99706)
- If a series of opcodes r/w overwrite a dreg, drop the store/load pair for the leading opcodes, i.e.
- Re-enable early trace abort with back branches active but only once a trace is long enough to justify it
- Add typecheck-free version of stelem_ref (only possible for sealed types, must be generated in interp) ([mono] Add unchecked version of stelem_ref interpreter opcode #99829)
- Update the msbuild targets to generate a single export arg to emcc instead of one per exported function
- Ensure IEEE spec compliance for the f32 and f64 opcodes that rely on libc or wasm opcodes
- Cache the this-reference (locals[0]) in a wasm local since it can't change
- Zero region optimizations
- Fuse null check and length check for arrays
- Fuse null check and length check for strings
- Fuse null check and length check for spans
- Fuse null check and type check for MINT_CASTCLASS/MINT_ISINST
- Interpreter integration
- Move cpblk unrolling into interpreter superinsn pass as mint_cpblk_imm
- Add new null-check-free versions of hot field opcodes
- Add new information table tracking things like known not-null state per local that are exposed to jiterpreter
- Consume information table from jiterpreter to do null check elimination
- Optimize size of null check bitset as described in [wasm] Re-enable null check optimization for mid-method traces #84058 (comment)
- Investigate migrating the trace generator into transform.c and doing it during the tiering process
- If interpreter verbose is set for a method the jiterpreter should honor that
- SIMD
- Raise interpreter inlining limit to 30
- Investigate raising it a bit further
- Caching / PGO
- Record a list of which methods are tiered in the interp so they can tier immediately on future runs
- Record a list of which traces we compile so that we can compile them early on future runs
- Cache jitted traces across page loads
- Cache do_jit_call trampolines across page loads
- Cache interp_entry wrappers across page loads
- Make sure that call_handler/leave work correctly in the event that we bail out from a trace into the interp ([wasm] Jiterpreter implementation of CALL_HANDLER is incorrect #98577)
- Cleanup
- Remove most jiterp cprop once we can rely on the interpreter to do it, for correctness reasons
Archived items
- Write a custom assembler and use it to generate and inline do-jit-call and simd detect modules ( #81691 )
- Also unroll memcpy like memset
- Investigate possible startup time regressions
- Investigate possible .wasm size regressions
- Update memmove unroller to ensure it does the correct thing for overlapping src/dest
- Enable jiterpreter jitcall and interp_entry JITs by default
- Enable jiterpreter traces by default
- Don't bail out for safepoints
- Do the 'is a safepoint needed' check inline in the trace instead of in the import
- Inline strlen into traces
- Inline getchr_ref into traces
- Inline getitem_span into traces
- Inline get_element_address_with_size_ref into traces
- Optimize out the eip local and initialization for traces containing no branches
- Generate import section after generating function body and omit unused imports
- Do another pass over intrinsics and superinsns to add any missing ones (like the log2 used for vectorization)
- Remove generated opcode info table and fetch opcode info from the interpreter's tables on demand to reduce file size
- Don't discard known not-null / known constant information when crossing branches, only branch targets
- Migrate configuration to options.h (requires improvements to the API)
- Verify that no debugging scenarios regress
- Better error handling for jiterpreter runtime failures (shut them off after a handful of JIT failures to avoid spamming the console and wasting CPU time)
- Optimize out memory.fill for common sizes (it produces an expensive function call on x86 and x64)
- Handle jiterpreter opcodes in non-wasm interp using the same path as other unreachable opcodes
- Fix floating point compares in jiterpreter