Skip to content

medgemma is not working as on android. working as expected in local optimum.executorch #185

@kamalkraj

Description

@kamalkraj

@kirklandsign

git clone https://github.com/huggingface/optimum-executorch.git
cd optimum-executorch
pip install .
optimum-cli export executorch \
  --model "google/medgemma-1.5-4b-it" \
  --task "multimodal-text-to-text" \
  --recipe "xnnpack" \
  --device cpu \
  --use_custom_sdpa \
  --use_custom_kv_cache \
  --qlinear 8da4w \
  --qlinear_group_size 32 \
  --qlinear_encoder "8da4w,8da8w" \
  --qlinear_encoder_group_size 32 \
  --qembedding "8w" \
  --qembedding_encoder "8w" \
  --output_dir="medgemma-1.5-4b-it-8da4w-executorch"

local inference and verification

from optimum.executorch import ExecuTorchModelForMultiModalToText
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("google/medgemma-1.5-4b-it")
model = ExecuTorchModelForMultiModalToText.from_pretrained("medgemma-1.5-4b-it-8da4w-executorch")

messages = [{"role": "user", "content": [{"type": "text", "text": "Hi"}]}]
model.eos_token_id = 106
out = model.text_generation(
    processor, processor.tokenizer, messages, echo=False, max_seq_len=1024
)

print("Generated Text:\n", out)

output

W0127 18:23:21.123000 10266 .venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
<unused94>thought
Thinking Process:

1.  **Identify the user's input:** The user simply said "Hi".
2.  **Recognize the intent:** "Hi" is a greeting. It's a simple, common way to start a conversation or acknowledge someone's presence.
3.  **Determine the appropriate response:** A polite and friendly greeting is expected in return.
4.  **Formulate a suitable response:**
    *   Start with a reciprocal greeting: "Hi" or "Hello".
    *   Add a friendly follow-up: "How can I help you today?" or "What can I do for you?" or "How are you doing?" or "What's on your mind?"
5.  **Select the best response:** "Hi there!" is a friendly start. Adding a helpful prompt like "How can I help you today?" or "What can I do for you?" is standard for an AI assistant.
6.  **Combine and refine:** "Hi there! How can I help you today?" or "Hello! What can I do for you?" are both good options. "Hi there!" feels slightly more informal and friendly, matching the user's simple "Hi".
7.  **Final Answer:** "Hi there! How can I help you today?"<unused95>Hi there! How can I help you today?<end_of_turn>
⚠️ DISCLAIMER: Python-based perf measurements are approximate and may not match absolute speeds on Android/iOS apps. They are intended for relative comparisons—-e.g. SDPA vs. custom SDPA, FP16 vs. FP32—-so you can gauge performance improvements from each optimization step. For end-to-end, platform-accurate benchmarks, please use the official ExecuTorch apps:
  • iOS:     https://github.com/pytorch/executorch/tree/main/extension/benchmark/apple/Benchmark
  • Android: https://github.com/pytorch/executorch/tree/main/extension/benchmark/android/benchmark

PyTorchObserver {"prompt_tokens": 10, "generated_tokens": 288, "model_load_start_ms": 0, "model_load_end_ms": 0, "inference_start_ms": 1769518409191, "token_encode_end_ms": 1769518409197, "model_execution_start_ms": 0, "model_execution_end_ms": 0, "inference_end_ms": 1769518433113, "prompt_eval_end_ms": 1769518411351, "first_token_ms": 1769518411429, "aggregate_sampling_time_ms": 23827, "SCALING_FACTOR_UNITS_PER_SECOND": 1000}
        Prompt Tokens: 10 Generated Tokens: 288
        Model Load Time:                0.000000 (seconds)
        Total inference time:           23.922000 (seconds)              Rate:  12.039127 (tokens/second)
                Prompt evaluation:      2.160000 (seconds)               Rate:  4.629630 (tokens/second)
                Generated 288 tokens:   21.762000 (seconds)              Rate:  13.234078 (tokens/second)
        Time to first generated token:  2.238000 (seconds)
        Sampling time over 298 tokens:  23.827000 (seconds)
Generated Text:
 thoughtThinking Process:1.**Identify the user's input:** The user simply said "Hi".2.**Recognize the intent:** "Hi" is a greeting. It's a simple, common way to start a conversation or acknowledge someone's presence.3.**Determine the appropriate response:** A polite and friendly greeting is expected in return.4.**Formulate a suitable response:***Start with a reciprocal greeting: "Hi" or "Hello".*Add a friendly follow-up: "How can I help you today?" or "What can I do for you?" or "How are you doing?" or "What's on your mind?"5.**Select the best response:** "Hi there!" is a friendly start. Adding a helpful prompt like "How can I help you today?" or "What can I do for you?" is standard for an AI assistant.6.**Combine and refine:** "Hi there! How can I help you today?" or "Hello! What can I do for you?" are both good options. "Hi there!" feels slightly more informal and friendly, matching the user's simple "Hi".7.**Final Answer:** "Hi there! How can I help you today?"Hi there! How can I help you today?

Copy the model and tokenizer

hf download google/medgemma-1.5-4b-it --local-dir medgemma-1.5-4b-it
cp medgemma-1.5-4b-it/tokenizer.model medgemma-1.5-4b-it-8da4w-executorch/

adb shell mkdir -p /data/local/tmp/llama/
adb push medgemma-1.5-4b-it-8da4w-executorch/tokenizer.model /data/local/tmp/llama/
adb push medgemma-1.5-4b-it-8da4w-executorch/model.pte /data/local/tmp/llama/ 

One device generartion is not working as expected. Following the same step for gemma3-4b-it working as expected.

Screen_recording_20260127_182610.mp4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions