-
Notifications
You must be signed in to change notification settings - Fork 21
Open
Description
git clone https://github.com/huggingface/optimum-executorch.git
cd optimum-executorch
pip install .optimum-cli export executorch \
--model "google/medgemma-1.5-4b-it" \
--task "multimodal-text-to-text" \
--recipe "xnnpack" \
--device cpu \
--use_custom_sdpa \
--use_custom_kv_cache \
--qlinear 8da4w \
--qlinear_group_size 32 \
--qlinear_encoder "8da4w,8da8w" \
--qlinear_encoder_group_size 32 \
--qembedding "8w" \
--qembedding_encoder "8w" \
--output_dir="medgemma-1.5-4b-it-8da4w-executorch"local inference and verification
from optimum.executorch import ExecuTorchModelForMultiModalToText
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("google/medgemma-1.5-4b-it")
model = ExecuTorchModelForMultiModalToText.from_pretrained("medgemma-1.5-4b-it-8da4w-executorch")
messages = [{"role": "user", "content": [{"type": "text", "text": "Hi"}]}]
model.eos_token_id = 106
out = model.text_generation(
processor, processor.tokenizer, messages, echo=False, max_seq_len=1024
)
print("Generated Text:\n", out)output
W0127 18:23:21.123000 10266 .venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
<unused94>thought
Thinking Process:
1. **Identify the user's input:** The user simply said "Hi".
2. **Recognize the intent:** "Hi" is a greeting. It's a simple, common way to start a conversation or acknowledge someone's presence.
3. **Determine the appropriate response:** A polite and friendly greeting is expected in return.
4. **Formulate a suitable response:**
* Start with a reciprocal greeting: "Hi" or "Hello".
* Add a friendly follow-up: "How can I help you today?" or "What can I do for you?" or "How are you doing?" or "What's on your mind?"
5. **Select the best response:** "Hi there!" is a friendly start. Adding a helpful prompt like "How can I help you today?" or "What can I do for you?" is standard for an AI assistant.
6. **Combine and refine:** "Hi there! How can I help you today?" or "Hello! What can I do for you?" are both good options. "Hi there!" feels slightly more informal and friendly, matching the user's simple "Hi".
7. **Final Answer:** "Hi there! How can I help you today?"<unused95>Hi there! How can I help you today?<end_of_turn>
⚠️ DISCLAIMER: Python-based perf measurements are approximate and may not match absolute speeds on Android/iOS apps. They are intended for relative comparisons—-e.g. SDPA vs. custom SDPA, FP16 vs. FP32—-so you can gauge performance improvements from each optimization step. For end-to-end, platform-accurate benchmarks, please use the official ExecuTorch apps:
• iOS: https://github.com/pytorch/executorch/tree/main/extension/benchmark/apple/Benchmark
• Android: https://github.com/pytorch/executorch/tree/main/extension/benchmark/android/benchmark
PyTorchObserver {"prompt_tokens": 10, "generated_tokens": 288, "model_load_start_ms": 0, "model_load_end_ms": 0, "inference_start_ms": 1769518409191, "token_encode_end_ms": 1769518409197, "model_execution_start_ms": 0, "model_execution_end_ms": 0, "inference_end_ms": 1769518433113, "prompt_eval_end_ms": 1769518411351, "first_token_ms": 1769518411429, "aggregate_sampling_time_ms": 23827, "SCALING_FACTOR_UNITS_PER_SECOND": 1000}
Prompt Tokens: 10 Generated Tokens: 288
Model Load Time: 0.000000 (seconds)
Total inference time: 23.922000 (seconds) Rate: 12.039127 (tokens/second)
Prompt evaluation: 2.160000 (seconds) Rate: 4.629630 (tokens/second)
Generated 288 tokens: 21.762000 (seconds) Rate: 13.234078 (tokens/second)
Time to first generated token: 2.238000 (seconds)
Sampling time over 298 tokens: 23.827000 (seconds)
Generated Text:
thoughtThinking Process:1.**Identify the user's input:** The user simply said "Hi".2.**Recognize the intent:** "Hi" is a greeting. It's a simple, common way to start a conversation or acknowledge someone's presence.3.**Determine the appropriate response:** A polite and friendly greeting is expected in return.4.**Formulate a suitable response:***Start with a reciprocal greeting: "Hi" or "Hello".*Add a friendly follow-up: "How can I help you today?" or "What can I do for you?" or "How are you doing?" or "What's on your mind?"5.**Select the best response:** "Hi there!" is a friendly start. Adding a helpful prompt like "How can I help you today?" or "What can I do for you?" is standard for an AI assistant.6.**Combine and refine:** "Hi there! How can I help you today?" or "Hello! What can I do for you?" are both good options. "Hi there!" feels slightly more informal and friendly, matching the user's simple "Hi".7.**Final Answer:** "Hi there! How can I help you today?"Hi there! How can I help you today?Copy the model and tokenizer
hf download google/medgemma-1.5-4b-it --local-dir medgemma-1.5-4b-it
cp medgemma-1.5-4b-it/tokenizer.model medgemma-1.5-4b-it-8da4w-executorch/
adb shell mkdir -p /data/local/tmp/llama/
adb push medgemma-1.5-4b-it-8da4w-executorch/tokenizer.model /data/local/tmp/llama/
adb push medgemma-1.5-4b-it-8da4w-executorch/model.pte /data/local/tmp/llama/ One device generartion is not working as expected. Following the same step for gemma3-4b-it working as expected.
Screen_recording_20260127_182610.mp4
Metadata
Metadata
Assignees
Labels
No labels