Enabling of the Performance model for NNPA / z17#3383
Enabling of the Performance model for NNPA / z17#3383AlexandreEichenberger wants to merge 29 commits intoonnx:mainfrom
Conversation
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
…ed as faster than CPU) Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>
|
Disabled the default cost benefit model for the @tungld let me know if there are others. Thanks |
|
Also ran granite 3 and granite embedding with/without the new placement heuristics and verified that we have no regressions. It made a small change for embedding (assigned a few scalar ops to CPU instead of NNPA, negligible perf difference but better use of resources, aka not using NNPA for a scalar). |
I’m a bit hesitant about disabling the default cost model. The current default is simple, which makes it useful for debugging and pinpointing issues. Unless there is a clear benefit to switching to another cost model (e.g., measurable performance improvement or reduced memory usage), I would prefer to keep the current one as the default and suggest that users opt into more advanced models only when necessary. Regarding the scalar sqrt issue, I am investigating whether I can make it work with minimal changes to the current cost model. |
|
Fair point... note that currently we have NO cost model... I could split this PR into 2, one that add the new support (which should not be controversial), and one that switch the default policy... While I agree that sending a scalar to zAIU or not is not a performance issue (both are very fast because they are small), I still see this as waiting hardware to send a scalar to a zAIU. So maybe we can augment the default policy to disable NNPA for operations with fewer than X (say 100) data points. |
|
@AlexandreEichenberger I created PR #3393 to not send scalar ops to NNPA. |
|
@tungld FYI I recall that it got 10% on a fraud model that had small vector sizes when enabling the cost models. |
…nberger/onnx-mlir into missing-model-z17
Thanks. That looks good since the model has only one stickification for input and one unstickification for output but the cost model can find a better one. Could you add a lit test for this model? |
tungld
left a comment
There was a problem hiding this comment.
LGTM!
Could you change all the lit tests to use your new default cost model instead of adding --nnpa-placement-heuristic=QualifyingOps? It's because --nnpa-placement-heuristic=QualifyingOps is not the default one with your PR, testing with it cannot capture the real onnx-mlir command. We would have lit tests for the default setting.
Also, could you please add lit tests for boundary, saying with size XXX it runs on CPU, with size XXX+1 it runs on NNPA? I think the boundaries for z16 and z17 are perhaps different. Also, they are also different when using vs not using multiple-threads.
Sorry for many requests.
In Granite embedding, I noticed that we are sending a scalar SQRT to NNPA because the current policy is to send any legal operations to the NNPA.
After adding support to missing new operations in z17 (sqrt & leaky relu), I noticed that some patterns transforming ONNX ops to ZHigh ops were firing without considerations for placement (aka the pass that determines where ops should run, NNPA vs CPU). So I added some additional conditions to rules for the ONNX -> ZHigh transition.
I also modified a bit the handling of dynamic dimensions expected values when plugged into the performance models for CPU and NNPA.