|
| 1 | +# Google QA |
| 2 | +* How to supply models to the device and discover () |
| 3 | +* What is the status of the models with tool calling (support) |
| 4 | +* Why we need delay (how many) delay(settings.closeSessionDelay) |
| 5 | +* How to check is GPU busy or not, how many, how many models in parallel (tooling for it), model in GPU availability & resources, can we share state of the models across the apps |
| 6 | + |
| 7 | +# Local models |
| 8 | +LiteRT is TF for on-device model inferencing https://github.com/google-ai-edge/LiteRT (C++ runtime library). |
| 9 | +To use the framework, the model should be processed to the LiteRT format. |
| 10 | +Here is the HF repo of already transformed models https://huggingface.co/litert-community |
| 11 | +The problem is the only one model which support tool calling is Hummer https://huggingface.co/litert-community/Hammer2.1-1.5b |
| 12 | + |
| 13 | +Android has a library to work with LiteRT from java (using NJI) |
| 14 | +* https://github.com/google-ai-edge/mediapipe |
| 15 | +* https://github.com/google-ai-edge/mediapipe/tree/master/mediapipe/tasks/java/com/google/mediapipe/tasks/genai/llminference |
| 16 | + And the documentation how to use it |
| 17 | +* https://ai.google.dev/edge/mediapipe/solutions/guide |
| 18 | +* https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/android |
| 19 | + |
| 20 | +For tool calling local agents extension is used |
| 21 | +https://github.com/google-ai-edge/ai-edge-apis/tree/main/local_agents |
| 22 | +https://ai.google.dev/edge/mediapipe/solutions/genai/function_calling/android |
| 23 | + |
| 24 | +# Model |
| 25 | +Gemma is the Google on-device model https://huggingface.co/litert-community/Gemma3-1B-IT |
| 26 | +Gemma3-1B-(IT)_(multi-prefill-seq)_(seq128)_(f32|q4|q8)_(ekv1280|ekv2048|ekv4096)_(block32|block128).task |
| 27 | +* Gemma → Model family (Gemma). |
| 28 | +* 3 → Likely version (v3). |
| 29 | +* 1B → Parameter size = ~1 billion parameters. |
| 30 | +* IT → Instruction-Tuned (fine-tuned for following instructions). |
| 31 | +* multi-prefill → Likely optimized for multiple prefill sequences in parallel. |
| 32 | +* seq128 → Maximum sequence length (128 tokens). |
| 33 | +* quantization |
| 34 | + * f32 → 32-bit floating point precision (full precision). |
| 35 | + * q4 → Quantized to 4-bit integers. |
| 36 | + * q8 → Quantized to 8-bit integers. |
| 37 | + * block32, block128 → Quantization block size. |
| 38 | +* ekv1280, ekv2048, ekv4096 → Likely means effective KV cache size (number of key/value slots in the attention cache). |
| 39 | +* .task → Model or runtime task file (probably used by the training/inference pipeline). |
| 40 | +* .litertlm → LiteRT Language Model format (optimized model for LiteRT runtime, smaller and faster). |
| 41 | +* Chipset identifiers |
| 42 | + * mtXXXX → MediaTek processors. |
| 43 | + * smXXXX → Qualcomm Snapdragon processors. |
0 commit comments