perf(android): replace flutter_gemma with llamadart — 150 MB → 42 MB APK by priyavratuniyal · Pull Request #3 · priyavratuniyal/koshika

priyavratuniyal · 2026-03-27T16:04:58Z

Summary

Replace flutter_gemma (~133 MB MediaPipe/LiteRT native libs) with llamadart (~5.3 MB, llama.cpp via Dart Native Assets)
Add two build flavors: full (all AI features) and lite (no LLM, no embeddings)
Switch vector search from flutter_gemma's SQLite store to ObjectBox native HNSW — eliminating the separate koshika_vectors.db
Users pick from 4 curated GGUF models (SmolLM2 360M → Gemma 3 1B) or any custom URL; no HuggingFace token required

APK Sizes (arm64)

Flavor	Before	After
full	150 MB	41.8 MB
lite	150 MB	35.7 MB

Commits

refactor(models) — @HnswIndex(dimensions: 384), app-owned RetrievalResult, LlmModelRegistry with 4 curated models, ObjectBox bindings regenerated
feat(services) — ModelDownloader with HTTP resume, progress callbacks, no token needed
refactor(ai) — LlmService + LlmEmbeddingService replacing GemmaService + EmbeddingService; VectorStoreService rewritten for ObjectBox HNSW; bge-small-en-v1.5 at 384-dim (was 768)
refactor(app) — kAiEnabled flag, appMain(), main_full.dart / main_lite.dart entry points, splash migration for 768→384 re-embedding, settings AI section redesign
feat(android) — Gradle productFlavors (lite/full), androidComponents.onVariants for per-variant jniLibs exclusion, llamadart hooks config (cpu_profile: compact, backends: [cpu]), CI builds both flavors

Key Technical Notes

Why llamadart not llama_cpp_dart: llama_cpp_dart has no flutter: plugin: declaration in its pubspec — native libs never compile on Android. llamadart uses Dart's Native Assets system to auto-download pre-compiled binaries during flutter build.

Why androidComponents.onVariants not productFlavors { packaging }: Flavor-level packaging.jniLibs.excludes in Gradle applies globally across all variants. The Variant API (androidComponents.onVariants) correctly scopes exclusions per-variant.

llamadart native lib reduction: Default bundle is 99 MB (7 CPU variants + Vulkan). Configured cpu_profile: compact (1 baseline variant) and backends: [cpu] (no Vulkan) in pubspec.yaml hooks to bring this down to 5.3 MB.

Test Plan

flutter analyze — 0 issues
Full flavor: download model → load → streaming chat response
Full flavor: import PDF → semantic search → citations in response
Full flavor: switch model — old file deleted, new model downloads
Full flavor: custom GGUF URL field in Settings
Lite flavor: no Chat tab, no AI Models section in Settings
Lite flavor: PDF import, dashboard, charts, FHIR export all work
Verify lite APK has no libllama.so / libggml*.so
Embedding migration: existing 768-dim embeddings re-computed on first launch

Summary by CodeRabbit

New Features
- App now offers two build variants: Lite (AI off) and Full (AI enabled)
- Select, download, load, and cancel on-device models; add custom models via GGUF URL
- AI can be disabled at startup
Improvements
- Redesigned model picker and active-model controls with progress/status and cancel action
- Automatic embedding migration and faster local vector search for biomarker retrieval

…ions - Split release APK per ABI (arm64 + x64 only), dropping universal fat build - Enable R8 minification and resource shrinking for release builds - Exclude Vulkan debug validation layer from release packaging - Exclude unused MediaPipe image generation libs from release packaging

Add @HnswIndex(dimensions: 384) to BiomarkerResult.embedding so ObjectBox can run native nearest-neighbor search — eliminating the separate koshika_vectors.db SQLite store that flutter_gemma required. Introduce RetrievalResult as an app-owned type so the RAG pipeline (ChatContextBuilder, CitationExtractor) no longer imports from flutter_gemma. Add LlmModelConfig and LlmModelRegistry with 4 curated public GGUF models (SmolLM2 360M, Qwen3 0.6B default, Llama 3.2 1B, Gemma 3 1B) and a custom-URL escape hatch. Regenerate ObjectBox bindings to pick up the HNSW annotation.

Add ModelDownloader to handle GGUF file downloads from arbitrary URLs (no HuggingFace token needed — all curated models are public). Supports progress callbacks for UI, cancellation via a flag checked between chunks, and download resume via HTTP Range headers using a .part temp file. Static helpers expose the models directory path so services and the splash migration can share one download location.

Swap the entire inference layer from flutter_gemma (MediaPipe/LiteRT, ~133 MB native libs) to llamadart (llama.cpp via Dart Native Assets, ~5.3 MB compact CPU build). LlmService replaces GemmaService: model-agnostic, takes an LlmModelConfig, uses ChatML prompt format compatible with Qwen3, SmolLM2, and Llama-3. No HuggingFace token needed — all GGUF models are public. LlmEmbeddingService replaces EmbeddingService: bge-small-en-v1.5 at 384 dimensions (down from 768). Embeddings are now async (Future) not sync, so all callers in VectorStoreService are updated with await. VectorStoreService is fully rewritten to use ObjectBox HNSW queries instead of a separate SQLite vector DB. No koshika_vectors.db.

…y points Introduce kAiEnabled global flag (set by entry point) to gate all AI features: Chat tab visibility, AI Models section in Settings, and vector indexing after PDF import. Add main_full.dart (aiEnabled: true) and main_lite.dart (aiEnabled: false) as separate Gradle flavor entry points. SplashScreen initialises LlmService and LlmEmbeddingService only when kAiEnabled, and runs a one-time migration to re-embed any stale 768-dim vectors from the old EmbeddingGemma model. Settings AI section is fully redesigned: users can pick from 4 curated GGUF models or paste a custom URL, with inline download/load controls and progress. No HuggingFace token prompt — all models are public.

Add two Gradle product flavors sharing the appType dimension: - full: includes all llama.cpp native libs (~5.3 MB, arm64) - lite: strips libllama.so, libggml*.so, libmtmd.so, libllamadart.so Uses androidComponents.onVariants for per-variant jniLibs exclusion — flavor-level packaging.jniLibs.excludes applies globally in Gradle and would strip AI libs from both variants. Configure llamadart's Native Assets hook in pubspec.yaml to use cpu_profile: compact (1 baseline CPU variant instead of 7) and backends: [cpu] (no Vulkan), reducing native lib overhead from 99 MB to 5.3 MB. Final APK sizes: full arm64 = 41.8 MB, lite = 35.7 MB (down from 150 MB on the pre-Phase-2 branch). CI builds both flavors and uploads all four APKs (arm64/x64 × lite/full) to the GitHub release.

…ction # Conflicts: # .github/workflows/release.yml # android/app/build.gradle.kts

coderabbitai · 2026-03-27T16:26:13Z

📝 Walkthrough

Walkthrough

The PR replaces flutter_gemma with llamadart for on-device LLM and embedding models, introduces Android product flavors (lite/full) and flavor-specific release APKs, adds new model/download/embedding services and model registry, converts ObjectBox embeddings to an HNSW index, and gates AI initialization and UI by a new kAiEnabled flag.

Changes

Cohort / File(s)	Summary
CI/CD & Android build `\.github/workflows/release.yml`, `android/app/build.gradle.kts`, `android/app/proguard-rules.pro`	Release workflow now builds two flavor-specific APKs (`lite`/`full`). Gradle adds `appType` flavors and excludes llama native libs for `lite`. Removed two MediaPipe `-dontwarn` entries.
Dependency & pubspec `pubspec.yaml`	Removed `flutter_gemma`, added `llamadart:^0.6.8` and native-backend hooks for arm64/x64 CPU.
Removed services (flutter_gemma) `lib/services/gemma_service.dart`, `lib/services/embedding_service.dart`	Deleted Gemma and old Embedding service implementations and their public APIs (download/load/generate/embed/streams).
New LLM & embedding services `lib/services/llm_service.dart`, `lib/services/llm_embedding_service.dart`, `lib/services/model_downloader.dart`	Added `LlmService` (GGUF model lifecycle, download, load, streaming generation), `LlmEmbeddingService` (GGUF embedding lifecycle + embed/embedBatch), and `ModelDownloader` (resumable HTTP downloads, progress, cancellation).
Vector store & search `lib/services/vector_store_service.dart`	Replaced external VectorStore with ObjectBox HNSW usage: `indexResults()` computes and persists embeddings in-place; `search()` uses ObjectBox nearestNeighborsF32 and returns `RetrievalResult`. Removed explicit initialize/isInitialized semantics.
ObjectBox & models `lib/models/biomarker_result.dart`, `lib/objectbox-model.json`, `lib/objectbox.g.dart`	Annotated `BiomarkerResult.embedding` with `@HnswIndex(dimensions: 384)`. Updated objectbox model metadata and generated code to include HNSW indexId/params.
Model config & registry `lib/models/llm_model_config.dart`	Added `LlmModelConfig`, `LlmModelRegistry`, GGUF filename/url parsing helpers, curated models, and custom-model validation/creation.
Core model additions `lib/models/retrieval_result.dart`, `lib/models/model_info.dart`	Added `RetrievalResult` DTO. Minor doc update to `ModelInfo.downloadProgress`.
Bootstrap & entry points `lib/main.dart`, `lib/main_full.dart`, `lib/main_lite.dart`, `lib/constants/ai_prompts.dart`	Extracted `appMain({aiEnabled})`, added global `kAiEnabled`, replaced Gemma globals with `LlmService`/`LlmEmbeddingService`, created `main_full`/`main_lite` entrypoints, renamed system prompt constant.
Screens & UI wiring `lib/screens/splash_screen.dart`, `lib/screens/chat_screen.dart`, `lib/screens/settings_screen.dart`, `lib/screens/reports_screen.dart`	Splash initializes ObjectBox and conditionally initializes LLM/embedding services when `kAiEnabled`. Chat/settings rewired from Gemma/embedding to LLM services, removed HF token flow, added model picker/custom-model dialog, and gated chat tab/indexing by `kAiEnabled`. Reports indexing now requires `kAiEnabled`.
Utilities & imports `lib/services/chat_context_builder.dart`, `lib/services/citation_extractor.dart`	Removed transitive `flutter_gemma` imports and added explicit `RetrievalResult` imports.

Sequence Diagram

sequenceDiagram
    actor User
    participant ChatScreen
    participant LlmService
    participant ModelDownloader
    participant LlamaEngine
    participant ObjectBox

    rect rgba(76, 175, 80, 0.5)
    Note over ChatScreen,LlmService: Download model flow
    User->>ChatScreen: Tap "Download Model"
    ChatScreen->>LlmService: downloadModel()
    LlmService->>ModelDownloader: download(url, filename, onProgress)
    ModelDownloader->>ModelDownloader: resume/stream HTTP chunks
    ModelDownloader-->>LlmService: return local path (on success)
    LlmService->>LlmService: emit status=ready
    end

    rect rgba(33, 150, 243, 0.5)
    Note over ChatScreen,LlamaEngine: Load & generate
    User->>ChatScreen: Tap "Load Model"
    ChatScreen->>LlmService: loadModel()
    LlmService->>LlamaEngine: loadModelFromPath(path)
    LlmService->>LlmService: emit status=loaded
    User->>ChatScreen: Send message
    ChatScreen->>LlmService: generateResponse(message, context?)
    LlmService->>LlamaEngine: generate(prompt)
    LlamaEngine-->>LlmService: stream tokens
    LlmService-->>ChatScreen: stream<String> tokens
    ChatScreen->>ChatScreen: render streaming response
    end

    rect rgba(156, 39, 176, 0.5)
    Note over ChatScreen,ObjectBox: Semantic search via ObjectBox HNSW
    ChatScreen->>LlmService: embed(query)
    LlmService->>LlmService: return embedding
    ChatScreen->>ObjectBox: BiomarkerResult_.embedding.nearestNeighborsF32(embedding, topK)
    ObjectBox-->>ChatScreen: List<RetrievalResult>
    ChatScreen->>ChatScreen: append sources to response
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

perf(android): APK size reduction phase 1 #2: Updates CI release workflow and Gradle packaging/flavor changes closely matching this PR's APK flavor and packaging edits.
Facelift changes and model loading UX fix #1: Touches app startup and model-loading flows similar to this PR's bootstrap/splash and auto-load changes.

Poem

🐰 I nibble code and hop with glee,
Two flavors sprout — lite and fully free,
Llamadart hums, downloads resume,
HNSW hops through vector bloom,
A rabbit's cheer for features new! 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: replacing flutter_gemma with llamadart and the dramatic APK size reduction from 150 MB to 42 MB, which directly reflects the PR objectives.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/apk-size-reduction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/screens/chat_screen.dart (1)

369-377: ⚠️ Potential issue | 🟡 Minor

Missing await on stopGeneration() call.

_startNewChat() calls llmService.stopGeneration() without await, but the same method is awaited in _stopGeneration() (line 349) and _loadSession() (line 217). This inconsistency could cause a race condition if generation is still running when the state is cleared.

🔧 Proposed fix

-  void _startNewChat() {
+  Future<void> _startNewChat() async {
     _generationSubscription?.cancel();
     _generationSubscription = null;
-    llmService.stopGeneration();
+    await llmService.stopGeneration();
     setState(() {
       _currentSession = null;
       _messages.clear();
     });
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@lib/screens/chat_screen.dart` around lines 369 - 377, The _startNewChat()
method calls llmService.stopGeneration() without awaiting it, risking a race
condition; make _startNewChat() async and await llmService.stopGeneration() just
like in _stopGeneration() and _loadSession(), then proceed to cancel
_generationSubscription, clear _currentSession and _messages inside setState;
ensure you preserve existing subscription cancellation and nulling of
_generationSubscription and handle any errors consistently with the other
callers.

🧹 Nitpick comments (1)

android/app/build.gradle.kts (1)
30-36: Release build uses debug signing — intentional for development?

The release build type is configured with signingConfig = signingConfigs.getByName("debug"). This is acceptable for testing and CI artifact generation, but production releases will require a proper release keystore configuration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@android/app/build.gradle.kts` around lines 30 - 36, The release build
currently sets signingConfig = signingConfigs.getByName("debug") inside the
buildTypes { release { ... } } block, which uses the debug keystore; replace
this with a proper release signing configuration by creating and referencing a
release SigningConfig (e.g., signingConfigs.create("release") or
signingConfigs.getByName("release")) and point the release buildType to that
release SigningConfig, or add logic to load the release keystore properties
(storeFile, storePassword, keyAlias, keyPassword) and assign them to the release
SigningConfig so production builds are signed with the correct keystore instead
of debug.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/models/llm_model_config.dart`:
- Around line 85-97: LlmModelConfig.custom currently accepts any downloadUrl;
add fail-fast validation inside the factory method LlmModelConfig.custom to
parse the URL (use Uri.parse) and ensure it uses the https scheme and the path
ends with ".gguf" (and is non-empty); if validation fails, throw an
ArgumentError with a clear message. Also trim/normalize the input string before
parsing and keep the rest of the returned LlmModelConfig fields unchanged so
invalid/insecure URLs never reach the downloader flow.

In `@lib/screens/settings_screen.dart`:
- Around line 950-977: The URL validation in _submit currently checks the raw
string with url.toLowerCase().endsWith('.gguf'), which allows query-only URLs
and misses cases where the path segment is what matters; update _submit to parse
Uri.parse(url) and validate that uri.pathSegments is non-empty and
uri.pathSegments.last.toLowerCase().endsWith('.gguf') (or, if pathSegments is
empty, check query parameters for a value ending in .gguf) and set _urlError if
not; also update _filenameFromUrl to derive the filename from
uri.pathSegments.last when present, otherwise try extracting a .gguf filename
from query parameter values before falling back to the 'Custom Model' default so
displayName and LlmModelRegistry.custom(name: ...) are always valid.

In `@lib/screens/splash_screen.dart`:
- Around line 70-95: The migration currently only runs when the embedder is
ready at splash; fix by ensuring _migrateEmbeddingsIfNeeded() is invoked after
every successful embedding model load by chaining it to
LlmEmbeddingService.loadModel() wherever that method is called (not just in
splash_screen.dart), or centralize the behavior by modifying
LlmEmbeddingService.loadModel() to call a post-load hook/event that triggers
_migrateEmbeddingsIfNeeded() on success; update all callers (e.g., settings code
that triggers embeddingService.loadModel()) to rely on that hook or to
explicitly .then((_) => _migrateEmbeddingsIfNeeded()) so stale 768-dim rows are
re-embedded immediately after any later embedder download/load.

In `@lib/services/llm_embedding_service.dart`:
- Around line 147-150: unloadModel() currently calls await
_engine!.unloadModel() but leaves the old LlamaEngine instance in _engine,
leaking native resources; update unloadModel() to fully dispose and clear the
engine by awaiting _engine?.dispose() (or dispose/unload as appropriate), then
set _engine = null, mirroring the error-path cleanup used in loadModel(); apply
the same dispose-and-null pattern to the other unload/error paths referenced
around the 187–195 region so every code path disposes the LlamaEngine and nulls
_engine.

In `@lib/services/llm_service.dart`:
- Around line 103-129: switchModel currently proceeds while an existing
downloadModel may still be streaming the previous file, causing stale callbacks
to update state and risking partial-file deletion; before tearing down the
current model or calling ModelDownloader.deleteModel(_config.filename), detect
any in-flight download (e.g. a tracked Future/Task or StreamSubscription like
_currentDownloadTask/_currentDownloadSubscription started by downloadModel),
request cancellation on it, await its completion, and remove its
progress/completion listeners so callbacks cannot update _modelInfo/_config
after the switch; only after the in-flight download has fully settled, call
unloadModel, delete the file, set _config = newConfig, persist prefs, and call
_updateStatus. Also ensure download completion callbacks verify the model id
(compare against _config.id) before updating state.
- Around line 219-223: unloadModel() currently calls _engine!.unloadModel() but
never disposes or nulls _engine, leaking native resources when switchModel()
loads a new engine; update unloadModel() to follow the pattern used in
loadModel()’s error handler by calling _engine!.dispose() after unloadModel(),
then set _engine = null and update the model status (e.g. via
_updateStatus(_modelInfo.copyWith(status: ModelStatus.unloaded))) so the engine
is fully cleaned up before a new model is loaded.

In `@lib/services/model_downloader.dart`:
- Around line 90-103: Before streaming/writing the response body, check
response.statusCode after request.close() and reject any non-success codes
(accept 206 Partial Content or any 2xx full responses); if the status is not
2xx/206 (e.g., 404/500/416) throw an error (or return a failure) and do not
proceed to set startByte/total or write/rename the .part file. Concretely, in
the download method where request.close() returns response (symbols:
request.close(), response, startByte, total, response.contentLength), add an
early guard that reads/consumes any error body if needed, logs/throws an
HttpException with response.statusCode and reasonPhrase, and aborts before any
file write/rename operations so failed HTTP responses never corrupt the local
cache.
- Around line 77-90: The download can be force-closed by cancel() which calls
_client.close(force: true) but exceptions thrown before the chunk loop (e.g.,
during request.close() or reading the response) are not normalized to
DownloadCancelledException; wrap the existing try block that creates _client,
calls request.close(), and iterates the response stream in a try { ... } on
Object catch (e, st) { if (_cancelled) throw DownloadCancelledException();
rethrow; } so any transport errors caused by a forced close are re-thrown as
DownloadCancelledException while other errors continue to propagate; reference
_client, cancel(), request.close(), the response stream iteration, and
DownloadCancelledException when making the change.

In `@pubspec.yaml`:
- Around line 52-54: android-x64 is missing the cpu_profile setting which causes
x64 builds to default to full; add cpu_profile: compact under the android-x64
platform entry to match android-arm64. Locate the android-x64 platform block
(label "android-x64") in pubspec.yaml and add the line cpu_profile: compact
alongside backends: [cpu] so both platforms use the same compact CPU profile.

---

Outside diff comments:
In `@lib/screens/chat_screen.dart`:
- Around line 369-377: The _startNewChat() method calls
llmService.stopGeneration() without awaiting it, risking a race condition; make
_startNewChat() async and await llmService.stopGeneration() just like in
_stopGeneration() and _loadSession(), then proceed to cancel
_generationSubscription, clear _currentSession and _messages inside setState;
ensure you preserve existing subscription cancellation and nulling of
_generationSubscription and handle any errors consistently with the other
callers.

---

Nitpick comments:
In `@android/app/build.gradle.kts`:
- Around line 30-36: The release build currently sets signingConfig =
signingConfigs.getByName("debug") inside the buildTypes { release { ... } }
block, which uses the debug keystore; replace this with a proper release signing
configuration by creating and referencing a release SigningConfig (e.g.,
signingConfigs.create("release") or signingConfigs.getByName("release")) and
point the release buildType to that release SigningConfig, or add logic to load
the release keystore properties (storeFile, storePassword, keyAlias,
keyPassword) and assign them to the release SigningConfig so production builds
are signed with the correct keystore instead of debug.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8da3eb0c-fb2f-4d9f-a781-6272298ef7f2

📥 Commits

Reviewing files that changed from the base of the PR and between 74029b4 and da98430.

⛔ Files ignored due to path filters (1)

pubspec.lock is excluded by !**/*.lock

📒 Files selected for processing (26)

.github/workflows/release.yml
android/app/build.gradle.kts
android/app/proguard-rules.pro
lib/constants/ai_prompts.dart
lib/main.dart
lib/main_full.dart
lib/main_lite.dart
lib/models/biomarker_result.dart
lib/models/llm_model_config.dart
lib/models/model_info.dart
lib/models/retrieval_result.dart
lib/objectbox-model.json
lib/objectbox.g.dart
lib/screens/chat_screen.dart
lib/screens/reports_screen.dart
lib/screens/settings_screen.dart
lib/screens/splash_screen.dart
lib/services/chat_context_builder.dart
lib/services/citation_extractor.dart
lib/services/embedding_service.dart
lib/services/gemma_service.dart
lib/services/llm_embedding_service.dart
lib/services/llm_service.dart
lib/services/model_downloader.dart
lib/services/vector_store_service.dart
pubspec.yaml

💤 Files with no reviewable changes (3)

android/app/proguard-rules.pro
lib/services/gemma_service.dart
lib/services/embedding_service.dart

coderabbitai · 2026-03-27T16:40:05Z

+      // AI services — only when enabled (full flavor)
+      if (app_main.kAiEnabled) {
+        app_main.llmService = LlmService(LlmModelRegistry.defaultModel);
+        await app_main.llmService.initialize();
+
+        app_main.embeddingService = LlmEmbeddingService();
+        await app_main.embeddingService.initialize();
+
+        app_main.vectorStoreService = VectorStoreService(
+          app_main.embeddingService,
+        );
+
+        // Auto-load models if already downloaded
+        if (app_main.llmService.currentModelInfo.status == ModelStatus.ready) {
+          // ignore: unawaited_futures
+          app_main.llmService.loadModel();
+        }
+        if (app_main.embeddingService.currentModelInfo.status ==
+            ModelStatus.ready) {
+          // ignore: unawaited_futures
+          app_main.embeddingService.loadModel().then((_) {
+            // Re-embed any results that have stale or missing embeddings
+            _migrateEmbeddingsIfNeeded();
+          });
+        }
      }


⚠️ Potential issue | 🟠 Major

Run the 768→384 migration on the first successful embedder load, not only during splash.

Lines 87-94 only schedule _migrateEmbeddingsIfNeeded() when the embedding model is already ready at startup. After an upgrade, users who download/load the new embedder later in Settings keep stale 768-dim rows until the next app restart, so semantic search/chat retrieval is wrong for that whole session.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@lib/screens/splash_screen.dart` around lines 70 - 95, The migration currently only runs when the embedder is ready at splash; fix by ensuring _migrateEmbeddingsIfNeeded() is invoked after every successful embedding model load by chaining it to LlmEmbeddingService.loadModel() wherever that method is called (not just in splash_screen.dart), or centralize the behavior by modifying LlmEmbeddingService.loadModel() to call a post-load hook/event that triggers _migrateEmbeddingsIfNeeded() on success; update all callers (e.g., settings code that triggers embeddingService.loadModel()) to rely on that hook or to explicitly .then((_) => _migrateEmbeddingsIfNeeded()) so stale 768-dim rows are re-embedded immediately after any later embedder download/load.

coderabbitai · 2026-03-27T16:40:05Z

+          android-x64:
+            backends: [cpu]



⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What are the valid cpu_profile options for llamadart native backends configuration?

💡 Result:

The valid cpu_profile options for llamadart native backends configuration, specifically for Android arm64 platforms under llamadart_native_backends in pubspec.yaml, are "full" (default, bundles all Android ARM CPU variants for best runtime optimization coverage) and "compact" (bundles only the baseline CPU variant for smaller package size). An advanced option "cpu_variants" can override cpu_profile with an exact list of canonical variants like "android_armv8.2_2", but it is not a cpu_profile value itself.

Citations:

1: https://pub.dev/documentation/llamadart/latest/index.html

2: https://github.com/leehack/llamadart

3: https://pub.dev/packages/llamadart

🌐 Web query:

llamadart android-x64 cpu_profile configuration native backends

💡 Result:

Llamadart is a Dart/Flutter plugin for running llama.cpp GGUF models locally, supporting Android-x64 among other platforms. For Android-x64 (primarily for emulators), native backends default to CPU and Vulkan, configurable via pubspec.yaml. Configuration is done in pubspec.yaml under hooks for llamadart_native_backends: hooks: user_defines: llamadart: llamadart_native_backends: platforms: android-x64: backends: [cpu, vulkan, opencl] # Optional: specify backends (defaults: cpu, vulkan) cpu_profile: full # Or 'compact' for smaller bundle cpu_profile controls CPU optimization variants bundled: - full (default): All Android CPU variants for broad compatibility and optimization. - compact: Baseline CPU variant only, smaller package size. - cpu_variants: Advanced, exact list like ['android_armv8.0_1'] (but ARM-focused; x64 uses x86_64 optimizations). CPU is always included as fallback. Vulkan requires Vulkan loader; OpenCL opt-in. Apple platforms non-configurable (Metal). On first run, it auto-downloads/builds native assets via Pure Native Assets (zero manual config needed). Example full config: hooks: user_defines: llamadart: llamadart_native_backends: platforms: android-x64: backends: [vulkan] cpu_profile: compact android-arm64: backends: [vulkan] cpu_profile: full Usage in code auto-selects best available backend.

Citations:

1: https://github.com/leehack/llamadart

2: https://pub.dev/packages/llamadart

3: https://pub.dev/documentation/llamadart/latest/index.html

Add missing cpu_profile: compact for android-x64 platform.

The android-arm64 platform specifies both backends: [cpu] and cpu_profile: compact, but android-x64 only has backends: [cpu]. Without an explicit cpu_profile, x64 builds will default to full (all CPU optimization variants), resulting in a larger bundle than arm64. This asymmetry undermines the APK size reduction goals of this PR.

🔧 Proposed fix

android-x64: backends: [cpu] + cpu_profile: compact

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

android-x64:

backends: [cpu]

android-x64:

backends: [cpu]

cpu_profile: compact

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pubspec.yaml` around lines 52 - 54, android-x64 is missing the cpu_profile setting which causes x64 builds to default to full; add cpu_profile: compact under the android-x64 platform entry to match android-arm64. Locate the android-x64 platform block (label "android-x64") in pubspec.yaml and add the line cpu_profile: compact alongside backends: [cpu] so both platforms use the same compact CPU profile.

priyavratuniyal · 2026-03-27T17:06:49Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: da98430ce9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-27T17:14:02Z

+      final response = await request.close();
+
+      // Determine total size
+      int total;
+      if (response.statusCode == HttpStatus.partialContent) {


Validate HTTP status before treating model download as successful

This download path starts consuming and persisting the response body without first rejecting non-success HTTP statuses, so 403/404/500 responses (often HTML or JSON error payloads) can be saved as .gguf files and then marked as successfully downloaded. In that scenario the UI transitions to ready, but model load fails later with a misleading error and the bad file remains on disk. Fail fast on any status other than 200/206 before writing the temp file.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-27T17:14:02Z

+    if (_isGenerating) await stopGeneration();
+    if (_engine != null) {
+      await _engine!.unloadModel();
+    }


Dispose and clear LlamaEngine when unloading chat model

loadModel() allocates a new LlamaEngine, but unloadModel() only calls unloadModel() on the existing engine and leaves _engine set. On repeated load/unload cycles (including model switches), this can leave prior engine instances undisposed while new ones are created, which risks leaking native llama resources and increasing memory pressure over time. The unload path should dispose and null _engine before subsequent loads.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-27T17:14:02Z

+  Future<void> unloadModel() async {
+    if (_engine != null) {
+      await _engine!.unloadModel();
+    }


Dispose and clear LlamaEngine when unloading embedder

The embedder service mirrors the same lifecycle issue: loadModel() creates a new LlamaEngine, but unloadModel() never disposes or clears _engine. Repeatedly loading/unloading embeddings can therefore accumulate undisposed native engine state, which is especially risky on memory-constrained devices. Dispose and set _engine to null in the unload flow.

Useful? React with 👍 / 👎.

- cancel and await in-flight model downloads before switching configs\n- ignore stale download callbacks so old transfers cannot mutate the new model state\n- dispose LlamaEngine instances on unload and error paths in both chat and embedding services\n- reject non-success HTTP responses before caching model files and normalize forced-cancel errors\n- share GGUF URL validation and filename derivation between settings and model config creation\n- move embedding migration into a reusable post-load flow used by splash and settings\n- keep flutter analyze and both lite/full release APK builds passing

coderabbitai

🧹 Nitpick comments (1)

lib/main.dart (1)

219-236: Consider using an enum or constants for navigation indices.

The hardcoded index logic (kAiEnabled ? 3 : 2) works but is fragile if more screens are added. A minor improvement would be to derive indices from the _screens list directly, though this is acceptable for the current scope.

♻️ Optional refactor to derive indices

 _NavItem(
-  index: kAiEnabled ? 3 : 2,
+  index: _screens.length - 1,
   currentIndex: _currentIndex,
   icon: Icons.settings_outlined,
   activeIcon: Icons.settings,
   label: 'Settings',
-  onTap: () =>
-      setState(() => _currentIndex = kAiEnabled ? 3 : 2),
+  onTap: () =>
+      setState(() => _currentIndex = _screens.length - 1),
 ),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@lib/main.dart` around lines 219 - 236, The code uses hardcoded navigation
indices (e.g., kAiEnabled ? 3 : 2) in the _NavItem constructors and onTap
handlers which is brittle; change the logic to derive indices from the _screens
list (or use named constants/enum) instead of literal numbers: compute the index
for the Chat and Settings items from _screens.indexOf(...) or from enum values
and replace occurrences in _NavItem(index: ...), currentIndex comparison, and
setState assignments that reference _currentIndex so indices remain correct when
screens change (refer to _NavItem, _currentIndex, kAiEnabled and the _screens
list).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@lib/main.dart`:
- Around line 219-236: The code uses hardcoded navigation indices (e.g.,
kAiEnabled ? 3 : 2) in the _NavItem constructors and onTap handlers which is
brittle; change the logic to derive indices from the _screens list (or use named
constants/enum) instead of literal numbers: compute the index for the Chat and
Settings items from _screens.indexOf(...) or from enum values and replace
occurrences in _NavItem(index: ...), currentIndex comparison, and setState
assignments that reference _currentIndex so indices remain correct when screens
change (refer to _NavItem, _currentIndex, kAiEnabled and the _screens list).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f3f1ec2b-8519-4aa3-aa6d-ca390d6bbc60

📥 Commits

Reviewing files that changed from the base of the PR and between da98430 and 6e329d0.

📒 Files selected for processing (7)

lib/main.dart
lib/models/llm_model_config.dart
lib/screens/settings_screen.dart
lib/screens/splash_screen.dart
lib/services/llm_embedding_service.dart
lib/services/llm_service.dart
lib/services/model_downloader.dart

🚧 Files skipped from review as they are similar to previous changes (1)

lib/screens/settings_screen.dart

priyavratuniyal added 7 commits March 26, 2026 22:56

Merge remote-tracking branch 'origin/main' into feature/apk-size-redu…

da98430

…ction # Conflicts: # .github/workflows/release.yml # android/app/build.gradle.kts

coderabbitai Bot reviewed Mar 27, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 27, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 27, 2026

View reviewed changes

priyavratuniyal merged commit bac5f9d into main Mar 27, 2026
2 checks passed

coderabbitai Bot mentioned this pull request Mar 28, 2026

feat(ai): prod-level LLM prompt routing system #4

Merged

12 tasks

Conversation

priyavratuniyal commented Mar 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

APK Sizes (arm64)

Commits

Key Technical Notes

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

priyavratuniyal commented Mar 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

priyavratuniyal commented Mar 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 27, 2026 •

edited

Loading