medgemma is not working as on android. working as expected in local optimum.executorch

@kirklandsign 
```bash
git clone https://github.com/huggingface/optimum-executorch.git
cd optimum-executorch
pip install .
```

```bash
optimum-cli export executorch \
  --model "google/medgemma-1.5-4b-it" \
  --task "multimodal-text-to-text" \
  --recipe "xnnpack" \
  --device cpu \
  --use_custom_sdpa \
  --use_custom_kv_cache \
  --qlinear 8da4w \
  --qlinear_group_size 32 \
  --qlinear_encoder "8da4w,8da8w" \
  --qlinear_encoder_group_size 32 \
  --qembedding "8w" \
  --qembedding_encoder "8w" \
  --output_dir="medgemma-1.5-4b-it-8da4w-executorch"
```

local inference and verification
```python
from optimum.executorch import ExecuTorchModelForMultiModalToText
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("google/medgemma-1.5-4b-it")
model = ExecuTorchModelForMultiModalToText.from_pretrained("medgemma-1.5-4b-it-8da4w-executorch")

messages = [{"role": "user", "content": [{"type": "text", "text": "Hi"}]}]
model.eos_token_id = 106
out = model.text_generation(
    processor, processor.tokenizer, messages, echo=False, max_seq_len=1024
)

print("Generated Text:\n", out)
````
output
```bash
W0127 18:23:21.123000 10266 .venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
<unused94>thought
Thinking Process:

1.  **Identify the user's input:** The user simply said "Hi".
2.  **Recognize the intent:** "Hi" is a greeting. It's a simple, common way to start a conversation or acknowledge someone's presence.
3.  **Determine the appropriate response:** A polite and friendly greeting is expected in return.
4.  **Formulate a suitable response:**
    *   Start with a reciprocal greeting: "Hi" or "Hello".
    *   Add a friendly follow-up: "How can I help you today?" or "What can I do for you?" or "How are you doing?" or "What's on your mind?"
5.  **Select the best response:** "Hi there!" is a friendly start. Adding a helpful prompt like "How can I help you today?" or "What can I do for you?" is standard for an AI assistant.
6.  **Combine and refine:** "Hi there! How can I help you today?" or "Hello! What can I do for you?" are both good options. "Hi there!" feels slightly more informal and friendly, matching the user's simple "Hi".
7.  **Final Answer:** "Hi there! How can I help you today?"<unused95>Hi there! How can I help you today?<end_of_turn>
⚠️ DISCLAIMER: Python-based perf measurements are approximate and may not match absolute speeds on Android/iOS apps. They are intended for relative comparisons—-e.g. SDPA vs. custom SDPA, FP16 vs. FP32—-so you can gauge performance improvements from each optimization step. For end-to-end, platform-accurate benchmarks, please use the official ExecuTorch apps:
  • iOS:     https://github.com/pytorch/executorch/tree/main/extension/benchmark/apple/Benchmark
  • Android: https://github.com/pytorch/executorch/tree/main/extension/benchmark/android/benchmark

PyTorchObserver {"prompt_tokens": 10, "generated_tokens": 288, "model_load_start_ms": 0, "model_load_end_ms": 0, "inference_start_ms": 1769518409191, "token_encode_end_ms": 1769518409197, "model_execution_start_ms": 0, "model_execution_end_ms": 0, "inference_end_ms": 1769518433113, "prompt_eval_end_ms": 1769518411351, "first_token_ms": 1769518411429, "aggregate_sampling_time_ms": 23827, "SCALING_FACTOR_UNITS_PER_SECOND": 1000}
        Prompt Tokens: 10 Generated Tokens: 288
        Model Load Time:                0.000000 (seconds)
        Total inference time:           23.922000 (seconds)              Rate:  12.039127 (tokens/second)
                Prompt evaluation:      2.160000 (seconds)               Rate:  4.629630 (tokens/second)
                Generated 288 tokens:   21.762000 (seconds)              Rate:  13.234078 (tokens/second)
        Time to first generated token:  2.238000 (seconds)
        Sampling time over 298 tokens:  23.827000 (seconds)
Generated Text:
 thoughtThinking Process:1.**Identify the user's input:** The user simply said "Hi".2.**Recognize the intent:** "Hi" is a greeting. It's a simple, common way to start a conversation or acknowledge someone's presence.3.**Determine the appropriate response:** A polite and friendly greeting is expected in return.4.**Formulate a suitable response:***Start with a reciprocal greeting: "Hi" or "Hello".*Add a friendly follow-up: "How can I help you today?" or "What can I do for you?" or "How are you doing?" or "What's on your mind?"5.**Select the best response:** "Hi there!" is a friendly start. Adding a helpful prompt like "How can I help you today?" or "What can I do for you?" is standard for an AI assistant.6.**Combine and refine:** "Hi there! How can I help you today?" or "Hello! What can I do for you?" are both good options. "Hi there!" feels slightly more informal and friendly, matching the user's simple "Hi".7.**Final Answer:** "Hi there! How can I help you today?"Hi there! How can I help you today?
```

### Copy the model and tokenizer
```bash
hf download google/medgemma-1.5-4b-it --local-dir medgemma-1.5-4b-it
cp medgemma-1.5-4b-it/tokenizer.model medgemma-1.5-4b-it-8da4w-executorch/

adb shell mkdir -p /data/local/tmp/llama/
adb push medgemma-1.5-4b-it-8da4w-executorch/tokenizer.model /data/local/tmp/llama/
adb push medgemma-1.5-4b-it-8da4w-executorch/model.pte /data/local/tmp/llama/ 
```

One device generartion is not working as expected. Following the same step for gemma3-4b-it working as expected.

https://github.com/user-attachments/assets/f339abbd-0f4a-4c68-8dfd-7ef98f45c62e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

medgemma is not working as on android. working as expected in local optimum.executorch #185

Copy the model and tokenizer

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

medgemma is not working as on android. working as expected in local optimum.executorch #185

Description

Copy the model and tokenizer

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions