Update llama_serving.md doc to reflect the latest changes #1386

pravg-amd · 2025-05-05T13:36:15Z

Update the shortfin response format, which contains the entire json instead of just the response text.
Update the export and compilation commands for sharded models.

1. Update the shortfin response format, which contains the entire json instead of just the response text. 2. Update the export and compilation commands for sharded models. Signed-off-by: Praveen G <[email protected]>

stbaione · 2025-05-05T14:51:04Z

docs/shortfin/llm/user/llama_serving.md

@@ -404,7 +409,11 @@ iree-compile /path/to/output/llama3.1-405b.mlir \
  --iree-hal-target-device=hip[5] \
  --iree-hal-target-device=hip[6] \
  --iree-hal-target-device=hip[7] \
-  --iree-hip-target=gfx942
+  --iree-hip-target=gfx942 \
+  --iree-opt-level=O3 \


Do we know why these extra flags are required now for the compile?

I am not aware of exactly why we need these optimizations. Without these, was running into the following error

ValueError: shortfin_iree-src/runtime/src/iree/vm/bytecode/verifier.c:344: RESOURCE_EXHAUSTED; register count overflow

These flags are used in the llama benchmark tests in sharktank. Maybe @AmosLewis can provide more insight to this ?

Update llama_serving.md doc to reflect the latest changes

9562f1d

1. Update the shortfin response format, which contains the entire json instead of just the response text. 2. Update the export and compilation commands for sharded models. Signed-off-by: Praveen G <[email protected]>

pravg-amd mentioned this pull request May 5, 2025

Shark May-25 release nod-ai/SHARK-ModelDev#951

Open

stbaione reviewed May 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update llama_serving.md doc to reflect the latest changes #1386

Update llama_serving.md doc to reflect the latest changes #1386

Uh oh!

pravg-amd commented May 5, 2025

Uh oh!

stbaione May 5, 2025

Uh oh!

pravg-amd May 6, 2025

Uh oh!

Uh oh!

Update llama_serving.md doc to reflect the latest changes #1386

Are you sure you want to change the base?

Update llama_serving.md doc to reflect the latest changes #1386

Uh oh!

Conversation

pravg-amd commented May 5, 2025

Uh oh!

stbaione May 5, 2025

Choose a reason for hiding this comment

Uh oh!

pravg-amd May 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!