Skip to content

perf(android): replace flutter_gemma with llamadart — 150 MB → 42 MB APK#3

Merged
priyavratuniyal merged 8 commits intomainfrom
feature/apk-size-reduction
Mar 27, 2026
Merged

perf(android): replace flutter_gemma with llamadart — 150 MB → 42 MB APK#3
priyavratuniyal merged 8 commits intomainfrom
feature/apk-size-reduction

Conversation

@priyavratuniyal
Copy link
Copy Markdown
Owner

@priyavratuniyal priyavratuniyal commented Mar 27, 2026

Summary

  • Replace flutter_gemma (~133 MB MediaPipe/LiteRT native libs) with llamadart (~5.3 MB, llama.cpp via Dart Native Assets)
  • Add two build flavors: full (all AI features) and lite (no LLM, no embeddings)
  • Switch vector search from flutter_gemma's SQLite store to ObjectBox native HNSW — eliminating the separate koshika_vectors.db
  • Users pick from 4 curated GGUF models (SmolLM2 360M → Gemma 3 1B) or any custom URL; no HuggingFace token required

APK Sizes (arm64)

Flavor Before After
full 150 MB 41.8 MB
lite 150 MB 35.7 MB

Commits

  1. refactor(models)@HnswIndex(dimensions: 384), app-owned RetrievalResult, LlmModelRegistry with 4 curated models, ObjectBox bindings regenerated
  2. feat(services)ModelDownloader with HTTP resume, progress callbacks, no token needed
  3. refactor(ai)LlmService + LlmEmbeddingService replacing GemmaService + EmbeddingService; VectorStoreService rewritten for ObjectBox HNSW; bge-small-en-v1.5 at 384-dim (was 768)
  4. refactor(app)kAiEnabled flag, appMain(), main_full.dart / main_lite.dart entry points, splash migration for 768→384 re-embedding, settings AI section redesign
  5. feat(android) — Gradle productFlavors (lite/full), androidComponents.onVariants for per-variant jniLibs exclusion, llamadart hooks config (cpu_profile: compact, backends: [cpu]), CI builds both flavors

Key Technical Notes

Why llamadart not llama_cpp_dart: llama_cpp_dart has no flutter: plugin: declaration in its pubspec — native libs never compile on Android. llamadart uses Dart's Native Assets system to auto-download pre-compiled binaries during flutter build.

Why androidComponents.onVariants not productFlavors { packaging }: Flavor-level packaging.jniLibs.excludes in Gradle applies globally across all variants. The Variant API (androidComponents.onVariants) correctly scopes exclusions per-variant.

llamadart native lib reduction: Default bundle is 99 MB (7 CPU variants + Vulkan). Configured cpu_profile: compact (1 baseline variant) and backends: [cpu] (no Vulkan) in pubspec.yaml hooks to bring this down to 5.3 MB.

Test Plan

  • flutter analyze — 0 issues
  • Full flavor: download model → load → streaming chat response
  • Full flavor: import PDF → semantic search → citations in response
  • Full flavor: switch model — old file deleted, new model downloads
  • Full flavor: custom GGUF URL field in Settings
  • Lite flavor: no Chat tab, no AI Models section in Settings
  • Lite flavor: PDF import, dashboard, charts, FHIR export all work
  • Verify lite APK has no libllama.so / libggml*.so
  • Embedding migration: existing 768-dim embeddings re-computed on first launch

Summary by CodeRabbit

  • New Features

    • App now offers two build variants: Lite (AI off) and Full (AI enabled)
    • Select, download, load, and cancel on-device models; add custom models via GGUF URL
    • AI can be disabled at startup
  • Improvements

    • Redesigned model picker and active-model controls with progress/status and cancel action
    • Automatic embedding migration and faster local vector search for biomarker retrieval

…ions

- Split release APK per ABI (arm64 + x64 only), dropping universal fat build
- Enable R8 minification and resource shrinking for release builds
- Exclude Vulkan debug validation layer from release packaging
- Exclude unused MediaPipe image generation libs from release packaging
Add @HnswIndex(dimensions: 384) to BiomarkerResult.embedding so
ObjectBox can run native nearest-neighbor search — eliminating the
separate koshika_vectors.db SQLite store that flutter_gemma required.

Introduce RetrievalResult as an app-owned type so the RAG pipeline
(ChatContextBuilder, CitationExtractor) no longer imports from
flutter_gemma. Add LlmModelConfig and LlmModelRegistry with 4 curated
public GGUF models (SmolLM2 360M, Qwen3 0.6B default, Llama 3.2 1B,
Gemma 3 1B) and a custom-URL escape hatch.

Regenerate ObjectBox bindings to pick up the HNSW annotation.
Add ModelDownloader to handle GGUF file downloads from arbitrary URLs
(no HuggingFace token needed — all curated models are public).

Supports progress callbacks for UI, cancellation via a flag checked
between chunks, and download resume via HTTP Range headers using a
.part temp file. Static helpers expose the models directory path so
services and the splash migration can share one download location.
Swap the entire inference layer from flutter_gemma (MediaPipe/LiteRT,
~133 MB native libs) to llamadart (llama.cpp via Dart Native Assets,
~5.3 MB compact CPU build).

LlmService replaces GemmaService: model-agnostic, takes an
LlmModelConfig, uses ChatML prompt format compatible with Qwen3,
SmolLM2, and Llama-3. No HuggingFace token needed — all GGUF models
are public.

LlmEmbeddingService replaces EmbeddingService: bge-small-en-v1.5 at
384 dimensions (down from 768). Embeddings are now async (Future) not
sync, so all callers in VectorStoreService are updated with await.

VectorStoreService is fully rewritten to use ObjectBox HNSW queries
instead of a separate SQLite vector DB. No koshika_vectors.db.
…y points

Introduce kAiEnabled global flag (set by entry point) to gate all AI
features: Chat tab visibility, AI Models section in Settings, and
vector indexing after PDF import.

Add main_full.dart (aiEnabled: true) and main_lite.dart (aiEnabled:
false) as separate Gradle flavor entry points. SplashScreen initialises
LlmService and LlmEmbeddingService only when kAiEnabled, and runs a
one-time migration to re-embed any stale 768-dim vectors from the old
EmbeddingGemma model.

Settings AI section is fully redesigned: users can pick from 4 curated
GGUF models or paste a custom URL, with inline download/load controls
and progress. No HuggingFace token prompt — all models are public.
Add two Gradle product flavors sharing the appType dimension:
- full: includes all llama.cpp native libs (~5.3 MB, arm64)
- lite: strips libllama.so, libggml*.so, libmtmd.so, libllamadart.so

Uses androidComponents.onVariants for per-variant jniLibs exclusion —
flavor-level packaging.jniLibs.excludes applies globally in Gradle and
would strip AI libs from both variants.

Configure llamadart's Native Assets hook in pubspec.yaml to use
cpu_profile: compact (1 baseline CPU variant instead of 7) and
backends: [cpu] (no Vulkan), reducing native lib overhead from 99 MB
to 5.3 MB. Final APK sizes: full arm64 = 41.8 MB, lite = 35.7 MB
(down from 150 MB on the pre-Phase-2 branch).

CI builds both flavors and uploads all four APKs (arm64/x64 × lite/full)
to the GitHub release.
…ction

# Conflicts:
#	.github/workflows/release.yml
#	android/app/build.gradle.kts
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 27, 2026

📝 Walkthrough

Walkthrough

The PR replaces flutter_gemma with llamadart for on-device LLM and embedding models, introduces Android product flavors (lite/full) and flavor-specific release APKs, adds new model/download/embedding services and model registry, converts ObjectBox embeddings to an HNSW index, and gates AI initialization and UI by a new kAiEnabled flag.

Changes

Cohort / File(s) Summary
CI/CD & Android build
\.github/workflows/release.yml, android/app/build.gradle.kts, android/app/proguard-rules.pro
Release workflow now builds two flavor-specific APKs (lite/full). Gradle adds appType flavors and excludes llama native libs for lite. Removed two MediaPipe -dontwarn entries.
Dependency & pubspec
pubspec.yaml
Removed flutter_gemma, added llamadart:^0.6.8 and native-backend hooks for arm64/x64 CPU.
Removed services (flutter_gemma)
lib/services/gemma_service.dart, lib/services/embedding_service.dart
Deleted Gemma and old Embedding service implementations and their public APIs (download/load/generate/embed/streams).
New LLM & embedding services
lib/services/llm_service.dart, lib/services/llm_embedding_service.dart, lib/services/model_downloader.dart
Added LlmService (GGUF model lifecycle, download, load, streaming generation), LlmEmbeddingService (GGUF embedding lifecycle + embed/embedBatch), and ModelDownloader (resumable HTTP downloads, progress, cancellation).
Vector store & search
lib/services/vector_store_service.dart
Replaced external VectorStore with ObjectBox HNSW usage: indexResults() computes and persists embeddings in-place; search() uses ObjectBox nearestNeighborsF32 and returns RetrievalResult. Removed explicit initialize/isInitialized semantics.
ObjectBox & models
lib/models/biomarker_result.dart, lib/objectbox-model.json, lib/objectbox.g.dart
Annotated BiomarkerResult.embedding with @HnswIndex(dimensions: 384). Updated objectbox model metadata and generated code to include HNSW indexId/params.
Model config & registry
lib/models/llm_model_config.dart
Added LlmModelConfig, LlmModelRegistry, GGUF filename/url parsing helpers, curated models, and custom-model validation/creation.
Core model additions
lib/models/retrieval_result.dart, lib/models/model_info.dart
Added RetrievalResult DTO. Minor doc update to ModelInfo.downloadProgress.
Bootstrap & entry points
lib/main.dart, lib/main_full.dart, lib/main_lite.dart, lib/constants/ai_prompts.dart
Extracted appMain({aiEnabled}), added global kAiEnabled, replaced Gemma globals with LlmService/LlmEmbeddingService, created main_full/main_lite entrypoints, renamed system prompt constant.
Screens & UI wiring
lib/screens/splash_screen.dart, lib/screens/chat_screen.dart, lib/screens/settings_screen.dart, lib/screens/reports_screen.dart
Splash initializes ObjectBox and conditionally initializes LLM/embedding services when kAiEnabled. Chat/settings rewired from Gemma/embedding to LLM services, removed HF token flow, added model picker/custom-model dialog, and gated chat tab/indexing by kAiEnabled. Reports indexing now requires kAiEnabled.
Utilities & imports
lib/services/chat_context_builder.dart, lib/services/citation_extractor.dart
Removed transitive flutter_gemma imports and added explicit RetrievalResult imports.

Sequence Diagram

sequenceDiagram
    actor User
    participant ChatScreen
    participant LlmService
    participant ModelDownloader
    participant LlamaEngine
    participant ObjectBox

    rect rgba(76, 175, 80, 0.5)
    Note over ChatScreen,LlmService: Download model flow
    User->>ChatScreen: Tap "Download Model"
    ChatScreen->>LlmService: downloadModel()
    LlmService->>ModelDownloader: download(url, filename, onProgress)
    ModelDownloader->>ModelDownloader: resume/stream HTTP chunks
    ModelDownloader-->>LlmService: return local path (on success)
    LlmService->>LlmService: emit status=ready
    end

    rect rgba(33, 150, 243, 0.5)
    Note over ChatScreen,LlamaEngine: Load & generate
    User->>ChatScreen: Tap "Load Model"
    ChatScreen->>LlmService: loadModel()
    LlmService->>LlamaEngine: loadModelFromPath(path)
    LlmService->>LlmService: emit status=loaded
    User->>ChatScreen: Send message
    ChatScreen->>LlmService: generateResponse(message, context?)
    LlmService->>LlamaEngine: generate(prompt)
    LlamaEngine-->>LlmService: stream tokens
    LlmService-->>ChatScreen: stream<String> tokens
    ChatScreen->>ChatScreen: render streaming response
    end

    rect rgba(156, 39, 176, 0.5)
    Note over ChatScreen,ObjectBox: Semantic search via ObjectBox HNSW
    ChatScreen->>LlmService: embed(query)
    LlmService->>LlmService: return embedding
    ChatScreen->>ObjectBox: BiomarkerResult_.embedding.nearestNeighborsF32(embedding, topK)
    ObjectBox-->>ChatScreen: List<RetrievalResult>
    ChatScreen->>ChatScreen: append sources to response
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Poem

🐰 I nibble code and hop with glee,
Two flavors sprout — lite and fully free,
Llamadart hums, downloads resume,
HNSW hops through vector bloom,
A rabbit's cheer for features new! 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: replacing flutter_gemma with llamadart and the dramatic APK size reduction from 150 MB to 42 MB, which directly reflects the PR objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/apk-size-reduction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/screens/chat_screen.dart (1)

369-377: ⚠️ Potential issue | 🟡 Minor

Missing await on stopGeneration() call.

_startNewChat() calls llmService.stopGeneration() without await, but the same method is awaited in _stopGeneration() (line 349) and _loadSession() (line 217). This inconsistency could cause a race condition if generation is still running when the state is cleared.

🔧 Proposed fix
-  void _startNewChat() {
+  Future<void> _startNewChat() async {
     _generationSubscription?.cancel();
     _generationSubscription = null;
-    llmService.stopGeneration();
+    await llmService.stopGeneration();
     setState(() {
       _currentSession = null;
       _messages.clear();
     });
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/screens/chat_screen.dart` around lines 369 - 377, The _startNewChat()
method calls llmService.stopGeneration() without awaiting it, risking a race
condition; make _startNewChat() async and await llmService.stopGeneration() just
like in _stopGeneration() and _loadSession(), then proceed to cancel
_generationSubscription, clear _currentSession and _messages inside setState;
ensure you preserve existing subscription cancellation and nulling of
_generationSubscription and handle any errors consistently with the other
callers.
🧹 Nitpick comments (1)
android/app/build.gradle.kts (1)

30-36: Release build uses debug signing — intentional for development?

The release build type is configured with signingConfig = signingConfigs.getByName("debug"). This is acceptable for testing and CI artifact generation, but production releases will require a proper release keystore configuration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@android/app/build.gradle.kts` around lines 30 - 36, The release build
currently sets signingConfig = signingConfigs.getByName("debug") inside the
buildTypes { release { ... } } block, which uses the debug keystore; replace
this with a proper release signing configuration by creating and referencing a
release SigningConfig (e.g., signingConfigs.create("release") or
signingConfigs.getByName("release")) and point the release buildType to that
release SigningConfig, or add logic to load the release keystore properties
(storeFile, storePassword, keyAlias, keyPassword) and assign them to the release
SigningConfig so production builds are signed with the correct keystore instead
of debug.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/models/llm_model_config.dart`:
- Around line 85-97: LlmModelConfig.custom currently accepts any downloadUrl;
add fail-fast validation inside the factory method LlmModelConfig.custom to
parse the URL (use Uri.parse) and ensure it uses the https scheme and the path
ends with ".gguf" (and is non-empty); if validation fails, throw an
ArgumentError with a clear message. Also trim/normalize the input string before
parsing and keep the rest of the returned LlmModelConfig fields unchanged so
invalid/insecure URLs never reach the downloader flow.

In `@lib/screens/settings_screen.dart`:
- Around line 950-977: The URL validation in _submit currently checks the raw
string with url.toLowerCase().endsWith('.gguf'), which allows query-only URLs
and misses cases where the path segment is what matters; update _submit to parse
Uri.parse(url) and validate that uri.pathSegments is non-empty and
uri.pathSegments.last.toLowerCase().endsWith('.gguf') (or, if pathSegments is
empty, check query parameters for a value ending in .gguf) and set _urlError if
not; also update _filenameFromUrl to derive the filename from
uri.pathSegments.last when present, otherwise try extracting a .gguf filename
from query parameter values before falling back to the 'Custom Model' default so
displayName and LlmModelRegistry.custom(name: ...) are always valid.

In `@lib/screens/splash_screen.dart`:
- Around line 70-95: The migration currently only runs when the embedder is
ready at splash; fix by ensuring _migrateEmbeddingsIfNeeded() is invoked after
every successful embedding model load by chaining it to
LlmEmbeddingService.loadModel() wherever that method is called (not just in
splash_screen.dart), or centralize the behavior by modifying
LlmEmbeddingService.loadModel() to call a post-load hook/event that triggers
_migrateEmbeddingsIfNeeded() on success; update all callers (e.g., settings code
that triggers embeddingService.loadModel()) to rely on that hook or to
explicitly .then((_) => _migrateEmbeddingsIfNeeded()) so stale 768-dim rows are
re-embedded immediately after any later embedder download/load.

In `@lib/services/llm_embedding_service.dart`:
- Around line 147-150: unloadModel() currently calls await
_engine!.unloadModel() but leaves the old LlamaEngine instance in _engine,
leaking native resources; update unloadModel() to fully dispose and clear the
engine by awaiting _engine?.dispose() (or dispose/unload as appropriate), then
set _engine = null, mirroring the error-path cleanup used in loadModel(); apply
the same dispose-and-null pattern to the other unload/error paths referenced
around the 187–195 region so every code path disposes the LlamaEngine and nulls
_engine.

In `@lib/services/llm_service.dart`:
- Around line 103-129: switchModel currently proceeds while an existing
downloadModel may still be streaming the previous file, causing stale callbacks
to update state and risking partial-file deletion; before tearing down the
current model or calling ModelDownloader.deleteModel(_config.filename), detect
any in-flight download (e.g. a tracked Future/Task or StreamSubscription like
_currentDownloadTask/_currentDownloadSubscription started by downloadModel),
request cancellation on it, await its completion, and remove its
progress/completion listeners so callbacks cannot update _modelInfo/_config
after the switch; only after the in-flight download has fully settled, call
unloadModel, delete the file, set _config = newConfig, persist prefs, and call
_updateStatus. Also ensure download completion callbacks verify the model id
(compare against _config.id) before updating state.
- Around line 219-223: unloadModel() currently calls _engine!.unloadModel() but
never disposes or nulls _engine, leaking native resources when switchModel()
loads a new engine; update unloadModel() to follow the pattern used in
loadModel()’s error handler by calling _engine!.dispose() after unloadModel(),
then set _engine = null and update the model status (e.g. via
_updateStatus(_modelInfo.copyWith(status: ModelStatus.unloaded))) so the engine
is fully cleaned up before a new model is loaded.

In `@lib/services/model_downloader.dart`:
- Around line 90-103: Before streaming/writing the response body, check
response.statusCode after request.close() and reject any non-success codes
(accept 206 Partial Content or any 2xx full responses); if the status is not
2xx/206 (e.g., 404/500/416) throw an error (or return a failure) and do not
proceed to set startByte/total or write/rename the .part file. Concretely, in
the download method where request.close() returns response (symbols:
request.close(), response, startByte, total, response.contentLength), add an
early guard that reads/consumes any error body if needed, logs/throws an
HttpException with response.statusCode and reasonPhrase, and aborts before any
file write/rename operations so failed HTTP responses never corrupt the local
cache.
- Around line 77-90: The download can be force-closed by cancel() which calls
_client.close(force: true) but exceptions thrown before the chunk loop (e.g.,
during request.close() or reading the response) are not normalized to
DownloadCancelledException; wrap the existing try block that creates _client,
calls request.close(), and iterates the response stream in a try { ... } on
Object catch (e, st) { if (_cancelled) throw DownloadCancelledException();
rethrow; } so any transport errors caused by a forced close are re-thrown as
DownloadCancelledException while other errors continue to propagate; reference
_client, cancel(), request.close(), the response stream iteration, and
DownloadCancelledException when making the change.

In `@pubspec.yaml`:
- Around line 52-54: android-x64 is missing the cpu_profile setting which causes
x64 builds to default to full; add cpu_profile: compact under the android-x64
platform entry to match android-arm64. Locate the android-x64 platform block
(label "android-x64") in pubspec.yaml and add the line cpu_profile: compact
alongside backends: [cpu] so both platforms use the same compact CPU profile.

---

Outside diff comments:
In `@lib/screens/chat_screen.dart`:
- Around line 369-377: The _startNewChat() method calls
llmService.stopGeneration() without awaiting it, risking a race condition; make
_startNewChat() async and await llmService.stopGeneration() just like in
_stopGeneration() and _loadSession(), then proceed to cancel
_generationSubscription, clear _currentSession and _messages inside setState;
ensure you preserve existing subscription cancellation and nulling of
_generationSubscription and handle any errors consistently with the other
callers.

---

Nitpick comments:
In `@android/app/build.gradle.kts`:
- Around line 30-36: The release build currently sets signingConfig =
signingConfigs.getByName("debug") inside the buildTypes { release { ... } }
block, which uses the debug keystore; replace this with a proper release signing
configuration by creating and referencing a release SigningConfig (e.g.,
signingConfigs.create("release") or signingConfigs.getByName("release")) and
point the release buildType to that release SigningConfig, or add logic to load
the release keystore properties (storeFile, storePassword, keyAlias,
keyPassword) and assign them to the release SigningConfig so production builds
are signed with the correct keystore instead of debug.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8da3eb0c-fb2f-4d9f-a781-6272298ef7f2

📥 Commits

Reviewing files that changed from the base of the PR and between 74029b4 and da98430.

⛔ Files ignored due to path filters (1)
  • pubspec.lock is excluded by !**/*.lock
📒 Files selected for processing (26)
  • .github/workflows/release.yml
  • android/app/build.gradle.kts
  • android/app/proguard-rules.pro
  • lib/constants/ai_prompts.dart
  • lib/main.dart
  • lib/main_full.dart
  • lib/main_lite.dart
  • lib/models/biomarker_result.dart
  • lib/models/llm_model_config.dart
  • lib/models/model_info.dart
  • lib/models/retrieval_result.dart
  • lib/objectbox-model.json
  • lib/objectbox.g.dart
  • lib/screens/chat_screen.dart
  • lib/screens/reports_screen.dart
  • lib/screens/settings_screen.dart
  • lib/screens/splash_screen.dart
  • lib/services/chat_context_builder.dart
  • lib/services/citation_extractor.dart
  • lib/services/embedding_service.dart
  • lib/services/gemma_service.dart
  • lib/services/llm_embedding_service.dart
  • lib/services/llm_service.dart
  • lib/services/model_downloader.dart
  • lib/services/vector_store_service.dart
  • pubspec.yaml
💤 Files with no reviewable changes (3)
  • android/app/proguard-rules.pro
  • lib/services/gemma_service.dart
  • lib/services/embedding_service.dart

Comment thread lib/models/llm_model_config.dart
Comment thread lib/screens/settings_screen.dart
Comment on lines +70 to 95
// AI services — only when enabled (full flavor)
if (app_main.kAiEnabled) {
app_main.llmService = LlmService(LlmModelRegistry.defaultModel);
await app_main.llmService.initialize();

app_main.embeddingService = LlmEmbeddingService();
await app_main.embeddingService.initialize();

app_main.vectorStoreService = VectorStoreService(
app_main.embeddingService,
);

// Auto-load models if already downloaded
if (app_main.llmService.currentModelInfo.status == ModelStatus.ready) {
// ignore: unawaited_futures
app_main.llmService.loadModel();
}
if (app_main.embeddingService.currentModelInfo.status ==
ModelStatus.ready) {
// ignore: unawaited_futures
app_main.embeddingService.loadModel().then((_) {
// Re-embed any results that have stale or missing embeddings
_migrateEmbeddingsIfNeeded();
});
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Run the 768→384 migration on the first successful embedder load, not only during splash.

Lines 87-94 only schedule _migrateEmbeddingsIfNeeded() when the embedding model is already ready at startup. After an upgrade, users who download/load the new embedder later in Settings keep stale 768-dim rows until the next app restart, so semantic search/chat retrieval is wrong for that whole session.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/screens/splash_screen.dart` around lines 70 - 95, The migration currently
only runs when the embedder is ready at splash; fix by ensuring
_migrateEmbeddingsIfNeeded() is invoked after every successful embedding model
load by chaining it to LlmEmbeddingService.loadModel() wherever that method is
called (not just in splash_screen.dart), or centralize the behavior by modifying
LlmEmbeddingService.loadModel() to call a post-load hook/event that triggers
_migrateEmbeddingsIfNeeded() on success; update all callers (e.g., settings code
that triggers embeddingService.loadModel()) to rely on that hook or to
explicitly .then((_) => _migrateEmbeddingsIfNeeded()) so stale 768-dim rows are
re-embedded immediately after any later embedder download/load.

Comment thread lib/services/llm_embedding_service.dart
Comment thread lib/services/llm_service.dart
Comment thread lib/services/llm_service.dart
Comment thread lib/services/model_downloader.dart
Comment thread lib/services/model_downloader.dart
Comment thread pubspec.yaml
Comment on lines +52 to 54
android-x64:
backends: [cpu]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What are the valid cpu_profile options for llamadart native backends configuration?

💡 Result:

The valid cpu_profile options for llamadart native backends configuration, specifically for Android arm64 platforms under llamadart_native_backends in pubspec.yaml, are "full" (default, bundles all Android ARM CPU variants for best runtime optimization coverage) and "compact" (bundles only the baseline CPU variant for smaller package size). An advanced option "cpu_variants" can override cpu_profile with an exact list of canonical variants like "android_armv8.2_2", but it is not a cpu_profile value itself.

Citations:


🌐 Web query:

llamadart android-x64 cpu_profile configuration native backends

💡 Result:

Llamadart is a Dart/Flutter plugin for running llama.cpp GGUF models locally, supporting Android-x64 among other platforms. For Android-x64 (primarily for emulators), native backends default to CPU and Vulkan, configurable via pubspec.yaml. Configuration is done in pubspec.yaml under hooks for llamadart_native_backends: hooks: user_defines: llamadart: llamadart_native_backends: platforms: android-x64: backends: [cpu, vulkan, opencl] # Optional: specify backends (defaults: cpu, vulkan) cpu_profile: full # Or 'compact' for smaller bundle cpu_profile controls CPU optimization variants bundled: - full (default): All Android CPU variants for broad compatibility and optimization. - compact: Baseline CPU variant only, smaller package size. - cpu_variants: Advanced, exact list like ['android_armv8.0_1'] (but ARM-focused; x64 uses x86_64 optimizations). CPU is always included as fallback. Vulkan requires Vulkan loader; OpenCL opt-in. Apple platforms non-configurable (Metal). On first run, it auto-downloads/builds native assets via Pure Native Assets (zero manual config needed). Example full config: hooks: user_defines: llamadart: llamadart_native_backends: platforms: android-x64: backends: [vulkan] cpu_profile: compact android-arm64: backends: [vulkan] cpu_profile: full Usage in code auto-selects best available backend.

Citations:


Add missing cpu_profile: compact for android-x64 platform.

The android-arm64 platform specifies both backends: [cpu] and cpu_profile: compact, but android-x64 only has backends: [cpu]. Without an explicit cpu_profile, x64 builds will default to full (all CPU optimization variants), resulting in a larger bundle than arm64. This asymmetry undermines the APK size reduction goals of this PR.

🔧 Proposed fix
           android-x64:
             backends: [cpu]
+            cpu_profile: compact
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
android-x64:
backends: [cpu]
android-x64:
backends: [cpu]
cpu_profile: compact
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pubspec.yaml` around lines 52 - 54, android-x64 is missing the cpu_profile
setting which causes x64 builds to default to full; add cpu_profile: compact
under the android-x64 platform entry to match android-arm64. Locate the
android-x64 platform block (label "android-x64") in pubspec.yaml and add the
line cpu_profile: compact alongside backends: [cpu] so both platforms use the
same compact CPU profile.

@priyavratuniyal
Copy link
Copy Markdown
Owner Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: da98430ce9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +90 to +94
final response = await request.close();

// Determine total size
int total;
if (response.statusCode == HttpStatus.partialContent) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Validate HTTP status before treating model download as successful

This download path starts consuming and persisting the response body without first rejecting non-success HTTP statuses, so 403/404/500 responses (often HTML or JSON error payloads) can be saved as .gguf files and then marked as successfully downloaded. In that scenario the UI transitions to ready, but model load fails later with a misleading error and the bad file remains on disk. Fail fast on any status other than 200/206 before writing the temp file.

Useful? React with 👍 / 👎.

Comment thread lib/services/llm_service.dart Outdated
Comment on lines +289 to +292
if (_isGenerating) await stopGeneration();
if (_engine != null) {
await _engine!.unloadModel();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Dispose and clear LlamaEngine when unloading chat model

loadModel() allocates a new LlamaEngine, but unloadModel() only calls unloadModel() on the existing engine and leaves _engine set. On repeated load/unload cycles (including model switches), this can leave prior engine instances undisposed while new ones are created, which risks leaking native llama resources and increasing memory pressure over time. The unload path should dispose and null _engine before subsequent loads.

Useful? React with 👍 / 👎.

Comment thread lib/services/llm_embedding_service.dart Outdated
Comment on lines +187 to +190
Future<void> unloadModel() async {
if (_engine != null) {
await _engine!.unloadModel();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Dispose and clear LlamaEngine when unloading embedder

The embedder service mirrors the same lifecycle issue: loadModel() creates a new LlamaEngine, but unloadModel() never disposes or clears _engine. Repeatedly loading/unloading embeddings can therefore accumulate undisposed native engine state, which is especially risky on memory-constrained devices. Dispose and set _engine to null in the unload flow.

Useful? React with 👍 / 👎.

- cancel and await in-flight model downloads before switching configs\n- ignore stale download callbacks so old transfers cannot mutate the new model state\n- dispose LlamaEngine instances on unload and error paths in both chat and embedding services\n- reject non-success HTTP responses before caching model files and normalize forced-cancel errors\n- share GGUF URL validation and filename derivation between settings and model config creation\n- move embedding migration into a reusable post-load flow used by splash and settings\n- keep flutter analyze and both lite/full release APK builds passing
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
lib/main.dart (1)

219-236: Consider using an enum or constants for navigation indices.

The hardcoded index logic (kAiEnabled ? 3 : 2) works but is fragile if more screens are added. A minor improvement would be to derive indices from the _screens list directly, though this is acceptable for the current scope.

♻️ Optional refactor to derive indices
 _NavItem(
-  index: kAiEnabled ? 3 : 2,
+  index: _screens.length - 1,
   currentIndex: _currentIndex,
   icon: Icons.settings_outlined,
   activeIcon: Icons.settings,
   label: 'Settings',
-  onTap: () =>
-      setState(() => _currentIndex = kAiEnabled ? 3 : 2),
+  onTap: () =>
+      setState(() => _currentIndex = _screens.length - 1),
 ),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/main.dart` around lines 219 - 236, The code uses hardcoded navigation
indices (e.g., kAiEnabled ? 3 : 2) in the _NavItem constructors and onTap
handlers which is brittle; change the logic to derive indices from the _screens
list (or use named constants/enum) instead of literal numbers: compute the index
for the Chat and Settings items from _screens.indexOf(...) or from enum values
and replace occurrences in _NavItem(index: ...), currentIndex comparison, and
setState assignments that reference _currentIndex so indices remain correct when
screens change (refer to _NavItem, _currentIndex, kAiEnabled and the _screens
list).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@lib/main.dart`:
- Around line 219-236: The code uses hardcoded navigation indices (e.g.,
kAiEnabled ? 3 : 2) in the _NavItem constructors and onTap handlers which is
brittle; change the logic to derive indices from the _screens list (or use named
constants/enum) instead of literal numbers: compute the index for the Chat and
Settings items from _screens.indexOf(...) or from enum values and replace
occurrences in _NavItem(index: ...), currentIndex comparison, and setState
assignments that reference _currentIndex so indices remain correct when screens
change (refer to _NavItem, _currentIndex, kAiEnabled and the _screens list).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f3f1ec2b-8519-4aa3-aa6d-ca390d6bbc60

📥 Commits

Reviewing files that changed from the base of the PR and between da98430 and 6e329d0.

📒 Files selected for processing (7)
  • lib/main.dart
  • lib/models/llm_model_config.dart
  • lib/screens/settings_screen.dart
  • lib/screens/splash_screen.dart
  • lib/services/llm_embedding_service.dart
  • lib/services/llm_service.dart
  • lib/services/model_downloader.dart
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/screens/settings_screen.dart

@priyavratuniyal priyavratuniyal merged commit bac5f9d into main Mar 27, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant