Skip to content

chore[skiplog|notask]: backmerge release-sdk-0.13.1 — bare-fetch ^3.0.1, decoder-audio ^0.4.0, decoder.ts sync run()#2579

Merged
NamelsKing merged 3 commits into
mainfrom
backmerge/release-sdk-0.13.1
Jun 15, 2026
Merged

chore[skiplog|notask]: backmerge release-sdk-0.13.1 — bare-fetch ^3.0.1, decoder-audio ^0.4.0, decoder.ts sync run()#2579
NamelsKing merged 3 commits into
mainfrom
backmerge/release-sdk-0.13.1

Conversation

@lauripiisang

Copy link
Copy Markdown
Contributor

What this PR does

Backmerges the @qvac/sdk + @qvac/bare-sdk 0.13.1 release onto main per gitflow "Keep main aligned". chore[skiplog].

Companion release PR

Files / delta vs main

  • packages/sdk/package.json + packages/bare-sdk/package.json — version 0.13.00.13.1; bare-fetch ^2.9.1^3.0.1; @qvac/decoder-audio ^0.3.7^0.4.0; sdk dev bare-subprocess ^5.2.3^6.1.0.
  • packages/sdk/server/utils/audio/decoder.ts — drop the await on decoder.run() (decoder-audio@0.4.0 returns QvacResponse synchronously).
  • packages/sdk/CHANGELOG.md + packages/sdk/changelog/0.13.1/ — changelog.

Cherry-picked cleanly onto main (no conflicts). This is what lands decoder-audio ^0.4.0 + the decoder.ts fix on main (the 0.12.3 backmerge #2565 that would otherwise have carried the decoder-audio bump is being closed).

Sequencing: merge after the 0.13.1 release (#2578) has published.

… dev bare-subprocess ^6.1.0 (sdk + bare-sdk 0.13.1)

decoder-audio@0.4.0 drops the deprecated @qvac/response (consolidated into
@qvac/infer-base) and returns QvacResponse synchronously from run(), so
server/utils/audio/decoder.ts no longer awaits decoder.run().

(cherry picked from commit ca8b494)
@lauripiisang lauripiisang requested review from a team as code owners June 12, 2026 19:28
@lauripiisang lauripiisang changed the title QVAC-17357 chore[skiplog|notask]: backmerge release-sdk-0.13.1 — bare-fetch ^3.0.1, decoder-audio ^0.4.0, decoder.ts sync run() chore[skiplog|notask]: backmerge release-sdk-0.13.1 — bare-fetch ^3.0.1, decoder-audio ^0.4.0, decoder.ts sync run() Jun 12, 2026
@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

@NamelsKing

Copy link
Copy Markdown
Contributor

/review

@NamelsKing NamelsKing merged commit 3002eea into main Jun 15, 2026
25 checks passed
@NamelsKing NamelsKing deleted the backmerge/release-sdk-0.13.1 branch June 15, 2026 10:00
tobi-legan pushed a commit that referenced this pull request Jun 15, 2026
….1, decoder-audio ^0.4.0, decoder.ts sync run() (#2579)

* fix[notask]: adopt @qvac/decoder-audio 0.4 + bump bare-fetch ^3.0.1 / dev bare-subprocess ^6.1.0 (sdk + bare-sdk 0.13.1)

decoder-audio@0.4.0 drops the deprecated @qvac/response (consolidated into
@qvac/infer-base) and returns QvacResponse synchronously from run(), so
server/utils/audio/decoder.ts no longer awaits decoder.run().

(cherry picked from commit ca8b494)

* chore[notask]: sdk + bare-sdk 0.13.1 changelog

(cherry picked from commit 3f6ac86)

---------

Co-authored-by: Dmytro Medvinskyi <functionsilence@gmail.com>
tobi-legan added a commit that referenced this pull request Jun 16, 2026
)

* QVAC-18929 test: add teardown/lifecycle coverage for llm-llamacpp

Adds integration coverage for the addon teardown contract that was
previously untested:
- unload() during active inference must not crash and the model must be
  reusable after a reload (AddonJs.hpp documents a use-after-free risk here)
- run() after unload() must surface a clean error, not segfault
- cancel() then immediate unload() must not race into a use-after-free

These run on desktop (on-pr-llm-llamacpp) and the mobile Device Farm pools
(scheduled via test-groups.json). Assertions are non-empty / type /
clean-error only, never exact generated text.

Also documents why a JS multi-cycle "RSS leak tripwire" was intentionally
NOT added to model-loading.test.js: a 6-cycle load/unload test already
exists (multi-instance.test.js) and the native ASan/LSAN job
(cpp-tests-llm.yml) is the precise leak detector. The addon exposes no
backend/state observable, so an NMT-style post-unload assertion is not
expressible here.

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-18929 chore: trim verbose comments in llm teardown tests

Remove the long rationale block in model-loading.test.js (the "why we didn't
add an RSS tripwire" explanation lives in the PR description, not the test
file) and tighten the comments in api-behavior.test.js to the essential
non-obvious intent. No behavior change — api-behavior.test.js still 8/8.

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-18929 test: parse and verify tool_call structure in tools-compact

Strengthens the two behavioral tests that already check for <tool_call>
presence. Adds a parseToolCalls helper that extracts and JSON-parses the
blocks, then verifies:
- the tool_call has a non-empty name
- the name matches a declared tool
- required argument keys are present

Structural/behavioral checks only - no model-quality assertions.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat[api]: DocTR depthwise convs via direct Metal CONV_2D_DW kernel (+ all-consumer fabric overlay) (#2536)

* feat[api]: fold DocTR detection BatchNorm into conv weights

The DBNet detection graph applied BatchNorm as a runtime scale/shift after
every conv: `conv + add(bias_br) + mul(scale) + add(shift) + act`, where
bias_br is a zero tensor for the (bias=False) BN convs. That is three
full-tensor elementwise passes per conv on top of the conv itself.

Fold the per-channel BN scale into the F16 conv weights and combine the conv
bias and BN shift into a single bias at load time:

    out = scale*(W*x + bias) + shift = (scale*W)*x + (scale*bias + shift)

so the runtime graph collapses to `conv + add(combined) + act` (one pass).
This removes ~60 elementwise passes from the detection graph, which matters
most on bandwidth/dispatch-constrained mobile GPUs where detection is ~55% of
the DocTR pipeline.

foldScaleIntoConv() folds scale into the preceding conv weight (per output
channel, ne[3]) and absorbs bias_br into the shift tensor; it handles both
normal BN (running stats present) and the offline-folded identity path. The
sub-pixel transposed conv in the prob head is left as a runtime scale/shift
since its weight is reshaped at graph build.

Numerically exact: region counts unchanged (365/197/187/197) and all DocTR
integration quality tests pass on Apple Metal (M4); ct_scan 1173->1157ms,
clinical 858->841ms.

* feat[api]: fold DocTR recognizer BatchNorm into conv weights

Apply the same BN-into-weights fold to the CRNN MobileNetV3-small feature
extractor: fold each BN's per-channel scale into the preceding conv's F16
weights at load time so applyBn drops the runtime multiply and keeps only the
shift add. The feature-extractor graph runs once per recognition batch (dozens
of times on a dense page), so removing a full-tensor multiply per conv is
amplified across the page.

Numerically exact: region counts unchanged and all DocTR integration quality
tests pass on Apple Metal (M4). Combined with the detection fold: ct_scan
1173->1134ms, clinical 858->829ms.

* feat[api]: use direct depthwise kernel for DocTR conv2d-dw on GPU

The DBNet detector and CRNN recognizer feature extractors are dominated by
depthwise convolutions. ggml's `ggml_conv_2d_dw` lowers each depthwise conv to
im2col + a per-channel batched matmul (C tiny matmuls), which is pathologically
slow on Metal — a skip-test (replacing every depthwise with a cheap same-shape
op) showed recognition was ~entirely depthwise (rec 0.7s -> ~0 on ct_scan, M4).

Switch both feature extractors to `ggml_conv_2d_dw_direct` (GGML_OP_CONV_2D_DW),
which runs as a single fused kernel (one read, one write, no im2col buffer). This
requires the companion Metal kernel for GGML_OP_CONV_2D_DW in the ggml fork
(qvac-ext-ggml); CPU and Vulkan already implement it.

Depthwise weights ([KW,KH,1,C], KW>1) are promoted to F32 at load so the op runs
on every backend (CPU's conv_2d_dw_direct requires F32; Metal/Vulkan accept F16
too but CPU does not). F32 is perf-neutral on the GPU — the per-channel K*K
weights are register-resident and the F32 activations dominate bandwidth either
way (measured identical). The load-time BN-scale fold now folds into F16 or F32
weights, and the recognizer weight upload converts F16 GGUF tensors to F32 for
depthwise.

Result on Apple M4 Metal (warm, vs the BN-fold baseline -> with both detection
and recognizer depthwise kernels): clinical 858->584ms, ct_scan 1173->754ms,
lab 838->579ms, liver 841->569ms (~31-36%). All DocTR integration quality tests
pass on Metal AND forced-CPU (region counts identical, keyword asserts intact).

* test[notask]: overlay qvac-fabric at the Metal CONV_2D_DW merge commit (all consumers)

Add a vcpkg overlay port to every qvac-fabric consumer (classification-ggml,
embed-llamacpp, llm-llamacpp, ocr-ggml, translation-nmtcpp, vla-ggml) that
builds fabric from the temp-8828 merge commit of the depthwise-conv kernel
(qvac-fabric-llm.cpp#148, commit 7bcd140f, version 8828.1.1). This validates the
new fabric across all consumers — and gives ocr-ggml the kernel its
ggml_conv_2d_dw_direct path needs — before fabric is tagged 8828.1.1 and a
registry port is cut. Revert these overlays + bump to the tagged port before merge.

* style: clang-format DocTR depthwise + BatchNorm fold

* fix[api]: clearer BN-fold dtype errors + comment on F16 scale round-trip

Address review feedback on the DocTR BatchNorm fold:
- The unsupported-dtype guard now names the actual ggml type and states
  that quantized conv weights are not supported by the fold (was a
  generic "unexpected conv weight dtype").
- output-channel mismatch now reports conv oc vs BN scale size.
- Comment the F16 weight scale: decode->scale->re-encode is required
  because F16 has no arithmetic; it is not an f16->f16 copy.

No behavior change; messages/comments only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports

Scope this PR to the all-consumer qvac-fabric overlay-port bump (which
validates that the new fabric does not regress any consumer). The DocTR
depthwise kernel switch (conv_2d_dw -> conv_2d_dw_direct + F32 depthwise
weight promotion + BN-fold dtype handling) will land separately.

Restores both files under packages/ocr-ggml/addon/src/model-interface/doctr/
to the base branch state.

* Revert "revert[api]: drop DocTR depthwise code changes, keep only fabric overlay ports"

This reverts commit 8c753e8.

* chore[notask]: migrate qvac-fabric 8828.1.1 from overlay to registry

8828.1.1 (Metal CONV_2D_DW kernel) is now published in qvac-registry-vcpkg
(tetherto/qvac-registry-vcpkg#189). Bump each consumer's qvac-fabric
version>= 8828.1.0 -> 8828.1.1 and drop the temporary vcpkg-overlay-ports
that pinned the unreleased temp-8828 commit.

The default-registry baseline is intentionally unchanged: vcpkg resolves
version>= against the registry HEAD, so a fixed baseline still picks up the
new tagged version.

Consumers: classification-ggml, embed-llamacpp, llm-llamacpp, ocr-ggml,
translation-nmtcpp, vla-ggml.

* chore[notask]: bump addon versions for qvac-fabric 8828.1.1

Minor-bump each consumer (major for ocr-ggml) and add a CHANGELOG entry for
the qvac-fabric 8828.1.1 dependency (direct Metal CONV_2D_DW depthwise kernel).
Minor bumps keep these out of the SDK 0.13.0 caret ranges so they are not
auto-picked.

- classification-ggml 0.3.1 -> 0.4.0
- embed-llamacpp      0.19.1 -> 0.20.0
- llm-llamacpp        0.24.0 -> 0.25.0
- ocr-ggml            0.1.1  -> 1.0.0   (major; cuts the prior Unreleased section)
- translation-nmtcpp  5.0.1  -> 5.1.0
- vla-ggml            0.3.2  -> 0.4.0

* style: clang-format DocTR recognizer BN-fold tensor get/set

Wrap the two over-long ggml_backend_tensor_get/set calls in the F16
BN-fold branch to satisfy the lint-cpp clang-format config
(AlignAfterOpenBracket: AlwaysBreak). No behavior change.

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* chore: drop bare-runtime and bare-pack from bare-sdk deps (#2585)

bare-runtime is only reachable via node-rpc-client (the Node host path that spawns bare); bare-sdk pins #rpc to bare-client, so it is never imported on bare. Removing it also avoids pulling ~80MB of per-platform bare prebuilds at install time.

bare-pack is only used by the Node-side bundle command, lazily resolved with a graceful BarePackNotInstalledError. Neither is reachable on bare. Both stay in @qvac/sdk and are added to SDK_ONLY_PACKAGES so check-deps-vs-sdk allows the intentional divergence.

* chore[notask]: bump bare-fetch to ^3.0.1 in ocr-ggml and translation-nmtcpp (#2584)

Aligns both addons with the latest bare-fetch major already used by rag
(0.6.3) and ocr-onnx, removing the duplicate older bare-fetch major from
the dependency tree.

- @qvac/ocr-ggml: 0.2.0 -> 0.2.1
- @qvac/translation-nmtcpp: 6.0.0 -> 6.0.1

* doc[notask]: release docs v0.13.0 (minor) (#2573)

Co-authored-by: NamelsKing <18405840+NamelsKing@users.noreply.github.com>
Co-authored-by: Bruno Campana <7632562+BrunoCampana@users.noreply.github.com>

* chore[skiplog|notask]: backmerge release-sdk-0.13.1 — bare-fetch ^3.0.1, decoder-audio ^0.4.0, decoder.ts sync run() (#2579)

* fix[notask]: adopt @qvac/decoder-audio 0.4 + bump bare-fetch ^3.0.1 / dev bare-subprocess ^6.1.0 (sdk + bare-sdk 0.13.1)

decoder-audio@0.4.0 drops the deprecated @qvac/response (consolidated into
@qvac/infer-base) and returns QvacResponse synchronously from run(), so
server/utils/audio/decoder.ts no longer awaits decoder.run().

(cherry picked from commit ca8b494)

* chore[notask]: sdk + bare-sdk 0.13.1 changelog

(cherry picked from commit 3f6ac86)

---------

Co-authored-by: Dmytro Medvinskyi <functionsilence@gmail.com>

* QVAC-19368 infra: rebalance Android Device Farm shards + faster mobile CI for LLM (#2466)

* perf(ci): split Android Device Farm shards to match iOS heavy/light pattern

Android groupB (11 tests, 49 min) and groupImagesPerf (3 VLM tests,
69 min) were serialising heavy tests on a single device — hitting the
2h job timeout on Pixel. Mirror the iOS strategy: isolate each heavy
test into its own group (heavy1–heavy10) and bundle fast tests into
lightA/lightB (12 groups total). Longest single shard drops from
~69 min to ~23 min; pool recycles devices across groups dynamically.

Co-authored-by: Cursor <cursoragent@cursor.com>

* perf(ci): reduce Android shards from 12 to 6 to avoid PENDING_CONCURRENCY

The 12-group mirror of iOS overwhelmed the Device Farm account
concurrency limit (24 total runs: 12 iOS + 12 Android). Groups
queued up to 12.5 min on Android and 28 min on iOS waiting for a
slot, making the monitor step slower than the original 3-group layout.

Revised to 6 Android groups (18 total with iOS):
  - heavyA/heavyB: split the old groupB heavy tests into 2 balanced shards
  - imagePerfA/imagePerfB: split VLM tests 2+1 to avoid the 69-min single-group bottleneck
  - lightA/lightB: fast tests bundled

Expected critical path: ~40-50 min (vs 69 min old, 87 min with 12 groups).

Co-authored-by: Cursor <cursoragent@cursor.com>

* perf(ci): parallelize Device Farm log downloads across runs

With 6 Android groups × 3 devices each = 18 device-jobs, the serial
log download took 52 min (each device-job ~3-7 min of API calls +
artifact downloads). Process each run's logs in parallel (up to 4
concurrent), so the total is bounded by the slowest single run (~18 min)
rather than the sum of all runs.

Combined with the 6-group monitor improvement (57 min vs old 69 min),
the estimated total Android job time drops to ~86 min — well within
the 120 min timeout.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(ci): descriptive group names + test legend in Device Farm monitor

Rename test groups to be self-documenting:
  - iOS: heavy1..heavy10 → finetuning, toolCalling, reasoning, etc.
  - Android: heavyA → heavyA-finetune-reason-ocr, imagePerfB → imagePerf-fruitPlate, etc.

Add test-specs passthrough to the monitor step so it can print:
  - A "Run → tests" legend at the start (which tests are in each run)
  - Test names in the final results section next to each run link

Now when a run fails you can immediately see which test(s) it contained
without cross-referencing test-groups.json.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(ci): wire test-specs to monitor step for all addons

Pass test-specs from upload-to-devicefarm through to the monitor step
in all 12 addon integration workflows. Gives every addon the run-to-tests
legend and test names in final results — not just LLM.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(ci): extend VLM pre-warmup to Android + add heavyC group

Two issues from the Android shard split:

1. Image test instability: the fruit-plate test relied on elephant
   running first in the same group to warm up the VLM model. With
   split groups each image test cold-starts alone, causing crashes
   on Android. Extended the iosWarmupImage pre-warmup to all mobile
   platforms (isMobile) so fruit-plate gets the elephant pre-warmup
   on Android too.

2. Heavy group imbalance: heavyA (4 tests, ~44 min) and heavyB
   (3 tests, ~45 min) were both too slow. Split into 3 balanced
   groups of 2-3 tests each:
   - heavyA-finetune-reasoning (2 tests)
   - heavyB-toolCall-gemma (2 tests)
   - heavyC-ocr-sliding (3 tests)

Android now has 7 groups (19 total with iOS 12).

Co-authored-by: Cursor <cursoragent@cursor.com>

* perf(ci): skip Setup/Teardown suite artifacts + raise parallel limit

Two optimizations for Device Farm log collection:

1. Skip 'Setup Test' and 'Teardown Test' suites — they only contain
   framework bookkeeping (home screen screenshots, install logs), not
   test output. Saves 2 list-artifacts API calls + downloads per
   device-job (21 Android device-jobs × 2 = 42 fewer API round-trips).

2. Raise MAX_PARALLEL from 4 to 8 so all runs (up to 7 Android + 12
   iOS) download simultaneously instead of in waves. AWS Device Farm
   API handles this fine — the bottleneck was I/O wait, not CPU.

Target: Android log collection from 25 min → ~12-15 min.
Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-19368 perf(ci): skip VLM aurora image on normal on-PR runs

The 3-image VLM perf (gemma4 + qwen3-5) made the Android on-PR leg run too
long. aurora is the heaviest image, so skip it when QVAC_PERF_RUNS is at
the on-PR default (<=1); the benchmark (QVAC_PERF_RUNS>1) still runs all 3.
On-PR now covers elephant + fruit-plate, keeping the Android run under ~1h.

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-19368 feat(ci): show test list in monitor for non-grouped addons

Default single-spec mode (addons without test-groups.json, e.g. NMT) runs
with an empty grep, so the monitor's "Run → tests" legend showed nothing.
Enumerate the addon's generated mobile runners (integration.auto.cjs) and
emit them as a display-only `tests` field on each spec; the monitor prefers
`tests` and falls back to `grep`. Grep stays empty so run behaviour is
unchanged — this only enriches the legend.

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-19368 feat(ci): per-test-case results summary JSON + run-summary link

Consolidate the per-device *_test-results.json into one
test-results-summary.json (each test case with status + duration per
device, gate-skips surfaced as 'skipped'), ship it inside the console-logs
artifact, and write a compact ✅/❌/⏭️ table + an artifact link to the
GitHub Step Summary. Makes it easy to see whether each case ran, passed,
failed, or was skipped (e.g. the VLM aurora on-PR gate). Mobile only for
now; desktop to follow.

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-19368 perf(ci): consolidate Android shards by measured runtime (10 -> 6)

The Android job was hitting the 120-min cap because ~15 Device Farm runs
queued behind the account concurrency limit (9-20min wait each), starving
the downstream collect/extract steps. Using measured worst-case per-test
runtimes, rebalance into 6 groups: toolCalling (~30m) and gemma4 (~29m)
run solo (each near the per-test cap), the other functional tests pack into
two ~50m shards, and the vlmPerf groups stay dedicated (so the benchmark
perf-only filter still isolates them). Fewer runs = less queue contention =
shorter monitor wait. All 30 functions stay covered; iOS unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-19368 perf(ci): download only essential Device Farm artifacts

Per team agreement, remove the download + upload of the full Device Farm
log tree (screenshots, XML, install logs, videos) — nobody uses it and it
adds significant download time to the already-tight Android job. Only
Customer_Artifacts.zip (bare_console.log, test-results.json, logcat_full,
perf data) and Logcat files (C++ logs) are kept. The extracted console-logs
and perf-report artifacts are unchanged. Raw Device Farm artifacts are
still accessible via the AWS console links in the monitor output.

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-19368 fix(ci): match Device Farm 'Customer Artifacts' name with space in download filter

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-19368 fix(ci): revert to sequential download loop with name filter

The parallelized download_run_logs with export -f was crashing on both
iOS and Android (exit code 1 before downloading any artifacts). Revert to
main's proven sequential loop structure and add the name filter there
instead. The filter still skips TCP dump (624MB), screenshots, XML, videos
— only Customer Artifacts + Logcat are downloaded. Job-level artifacts
restored too (iOS needs the job-level Customer_Artifacts.zip).

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-19368 feat(ci): parse individual test() cases from TAP output

Parse brittle's TAP output (ok N / not ok N lines) from logcat_full.txt
(Android) and bare_console.log (iOS) to surface every individual test()
case with its status (passed/failed/skipped) and timing per device.
Produces test-case-details.json in the console-logs artifact with both
runner-function-level and per-test() detail. The GitHub Step Summary gets a
runner table + a collapsible per-test-case table so reviewers can see at a
glance whether a newly added test() actually ran.

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-19368 fix(ci): clean device labels in test-case-details (strip log-type suffix)

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-19368 fix(ci): normalize test names + escape markdown in test-case summary

Two fixes for the test-case-details step summary:

1. Normalize dynamic values in test names so the same logical test merges
   across devices — e.g. 'CacheTokens (53) > 0' and 'CacheTokens (55) > 0'
   both become 'CacheTokens (N) > 0'. This was inflating Android's count
   (1240) vs iOS (633) because each device produced slightly different
   token counts in assertion names.

2. Escape markdown-special characters (|, <, >, backtick) in test names
   before writing to the step summary table, so test descriptions containing
   these characters don't break the table layout.

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-19368 fix(ci): strip trailing quotes from TAP test names (ReactNativeJS echo)

Android logcat echoes TAP lines twice: once from bare, once from
ReactNativeJS wrapped in single quotes ('ok 1 - name'). The trailing
quote made every test appear as a duplicate (e.g. 'All models available'
vs 'All models available' with trailing quote), inflating Android OCR
from 790 to 1461 test cases. Strip trailing quotes before deduplication.

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-19368 fix(ci): dedup TAP by normalized name + truncate assertion details

Two fixes for correct test case counts across platforms:

1. Deduplicate TAP results by normalized name (not num+name) so perf
   iterations that reuse the same test name at different TAP numbers
   don't inflate the count. NMT was 364/546 → now 182=182.

2. Truncate test names at assertion detail markers ('. Found ', ': "',
   ', got ') so variable model output embedded in assertion messages
   doesn't create per-device duplicates. LLM elephant tests with
   'Found keywords: elephant' in different phrasing now merge.

Verified across all addons with real data:
  OCR: Android=669, iOS=669 (exact match)
  NMT: Android=182, iOS=182 (exact match)
  LLM: Android=622, iOS=608 (16 A-only = bitnet/Android-only tests,
       2 I-only = Metal/iOS-only tests — all genuinely platform-specific)

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-19368 fix(ci): address review — Android-scoped aurora skip + revert pre-warmup

Two fixes per Dima's review:

1. Aurora skip is now Android-scoped using the explicit QVAC_PERF_ONLY
   flag (already plumbed to the device via the testspec config) instead
   of proxying off PERF_RUNS. iOS + desktop always run aurora. The
   benchmark (QVAC_PERF_ONLY=true) runs all 3 images on all platforms,
   even with runs=1.

2. Revert the Android pre-warmup extension back to iOS-only. The change
   was silently altering what Android perf numbers measure (cold first-
   run vs warm steady-state) and doesn't fix the crash it targeted
   (the large buffer allocation still happens on the first real-image
   pass). Restores historical comparability of Android perf data.

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-19368 perf(ci): split cacheStateMachine to solo + rebalance to 3 func shards

cacheStateMachine takes 30m on Pixel (hit the per-test Mocha timeout in
funcShardB). Move it to a solo group (like toolCalling and gemma) and
rebalance the remaining functional tests into 3 shards (~25-29m each on
Pixel worst-case). Total Android groups: 8 (3 solo + 3 func + 2 vlmPerf).

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-19368 infra: bump LLM mobile job timeout to 150min (from 120min)

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(review): address Ian's feedback on test assertions

- parseToolCalls: surface malformed JSON as t.fail() instead of silent catch
- api-behavior: use instanceof Error to reject undefined/string rejections

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: olyasir <sirkinolya@gmail.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Opanin Akuffo <46673050+opaninakuffo@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: NamelsKing <18405840+NamelsKing@users.noreply.github.com>
Co-authored-by: Bruno Campana <7632562+BrunoCampana@users.noreply.github.com>
Co-authored-by: Lauri Piisang <lauri.piisang@gmail.com>
Co-authored-by: Dmytro Medvinskyi <functionsilence@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants