Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
build/
testdbs/*
tmp_dbs/*
tmp/
.gradle
config/application.properties
gradle/
Expand Down
153 changes: 153 additions & 0 deletions USH-6370-case_search_bug/Notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
## issue
When using case search with grouped cases formplayer sometimes includes child cases of the wrong type. All cases should have the same case type.

This does not happen very often but we have script to reproduce it: `test-case-search_py.py`

## logging

### production

On the formplayer side we added logging that is specific to the BHA where we have seen the issue. It depends on the first column to be empty when the cases are not of the correct type. That change is in master in the file: `/src/main/java/org/commcare/formplayer/application/MenuController.java#L216`

### staging

The branch `riese/skip_cache_2` has more logging and has been deployed to staging. All the added logs start with the prefix "USH-6370"

## Things to investigate

- [ ] Does the logging in CaseSearchHelper rule out the issue being in the hq case search?
- [ ] Make a table of logging locations and if we see the issue there.
- [ ] How do the wrongly returned cases related to the parent? Are they actual children? Just of the wrong type?

## Checkpoint table

Tracks where the wrong-case-type issue is visible as the response flows
through formplayer. "ok" = only the expected case type seen. "mismatch" =
multiple case types observed.

| # | Checkpoint | File:Line | What it checks | Result | Count |
|---|---------------------------------------------|----------------------------------------------|----------------------------------------------------------------|----------|-------|
| 1 | `validateCaseTypesInResponse` | CaseSearchHelper.java:175 | Raw HQ XML response bytes | ok | 184 cases |
| 2 | `validateCaseTypesInStorage` | CaseSearchHelper.java:219 | SQLite storage after parsing | ok | — |
| 3 | `initReferences` (commcare-core) | EntityScreen.initReferences | Case IDs picked up by the nodeset evaluation | oversized | 238 IDs (54 not in HQ response) |
| 4 | `entityListScreen.getEntities` (sb1) | EntityListResponse.java ~L91 | Full entity list before pagination | mismatch | 5 types: unit, commcare-case-claim, provider, clinic, capacity |
| 5 | `paginateEntities` (sb2) | EntityListResponse.java ~L102 | Paginated page of entities | mismatch | 2 types: unit, capacity |
| 6 | `processEntitiesForCaseList` (sb3) | EntityListResponse.java ~L112 | Final `EntityBean[]` after bean conversion | mismatch | 2 types: unit, capacity (106 IDs, only 13 in HQ) |
| 7 | `controller` (`logCaseTypeColumnIfPresent`) | MenuController.java:225 | Final `EntityListResponse` returned from controller | mismatch | 2 types: unit, capacity |

### Key findings from sentry_logs_2

1. **HQ → SQLite is clean.** Checkpoints 1–2 rule out HQ and parsing.
2. **`initReferences` already has 54 extra IDs** not present in the HQ
response. The nodeset is evaluating against something larger than the
search-results instance (or against the search-results instance
after it's been polluted).
3. **Of the 106 cases shown to the user, only 13 came from the HQ
search response.** 93 are from elsewhere. So the entity list is
overwhelmingly populated with cases outside the search results.
4. **Pagination/bean conversion don't change the set** — they just
narrow/transform it. Mismatch at checkpoints 5/6/7 is downstream
of the root cause.
5. Checkpoint 6's prior "ok" reading was **invalid** — it only detected
`GraphData` → HTML transforms via `processData()`, never case-type
changes. Fixed in commit `cc98d17bc`.

### Hypothesis

The `EntityScreen`'s evaluation context references an `ExternalDataInstance`
whose contents include more than the current request's case search
results — either a stale/cached instance from a prior request, or the
data instance is being merged with the regular casedb. Worth checking:

- The `ExternalDataInstance` identity/source bound to the screen at init
time vs. what was just written to SQLite for this request.
- `entityScreenCache` in `MenuSession` — does a reused cached screen
carry a stale data instance reference?
- Whether `FormplayerRemoteInstanceFetcher` returns the fresh instance
even when `skipCache=true` is in effect.

### Findings from sentry_4 (EvalContext inventory, after logging fix)

On a confirmed failure iteration, the `results` `DataInstance` bound to
the `EntityScreen` holds **1189 cases of 4 types** (`capacity`, `unit`,
`provider`, `clinic`) — even though HQ returned only 184 capacity cases
and `validateCaseTypesInStorage` said "ok" earlier in the same request.

| Instance | class | ref | rootName | children | caseTypes (sampled) |
|-----------------------|------------------|------------------------------------|----------|----------|------------------------------------------|
| `casedb` | CaseDataInstance | jr://instance/casedb | casedb | 380 | [commcare-case-claim] (5) |
| **`results`** | CaseDataInstance | jr://instance/remote/results | casedb | **1189** | **[capacity, unit, provider, clinic]** (50) |
| `search-input:results`| ExternalDataInstance | jr://instance/search-input/results | input | 0 | — |
| `commcaresession` | ExternalDataInstance | jr://instance/session | session | 3 | — |

Source URI on `results` is correct (points at the case search endpoint
for this app). Both `casedb` and `results` are backed by
`CaseInstanceTreeElement` (SQLite). So the `results` instance is wired
to a SQLite table that contains 1189 cases from multiple case types,
not just the 184 capacity cases this request wrote and validated.

### New hypothesis (post sentry_4)

Between `validateCaseTypesInStorage` (run right after
`parseIntoCaseSearchStorage` writes the HQ response) and the moment the
`CaseInstanceTreeElement` is iterated by the entity screen, the SQLite
table backing the `results` instance has accumulated extra rows.

Likely mechanisms:
1. **Concurrent request pollution** — multiple requests share the same
`formplayer_case_<hash>` table and race: A's `removeAll()` +
`parseIntoSandbox` overlaps with B's, or A reads after B has written
different data into the same table.
2. **Cache-key collision** — two logically-different searches produce
the same `cacheKey` → same `caseSearchTableName`. Different search
queries (different case_types) would pile up in one table.
3. **`removeAll()` not clearing** — the per-table `removeAll()` leaves
rows, so each request adds to a growing set.

1189 cases / 184 per request ≈ 6 requests of accumulated data, which is
consistent with either mechanism.

### Added logging (commit TBD)

`CaseSearchHelper.getExternalRoot` now logs just before returning the
`CaseInstanceTreeElement` to the caller:

```
USH-6370 getExternalRoot returning tree: table=<hash> cacheKeyHash=<hex>
storageCount=<N> caseTypes=[...]
```

- If `storageCount` != the count validated a few lines earlier in the
same call, the write/read gap is where the pollution enters.
- If two requests log the **same** `table` but **different** `cacheKeyHash`,
cache-key collision (mechanism 2) is confirmed.
- If two concurrent requests log the same table + same key but with
different case counts/types in quick succession, mechanism 1.

## Possible root cause

Walking through the code:

1. CaseChildElement.cache() calls parent.getElement(recordId, context) to get case data (libs/commcare/.../CaseChildElement.java:100)
2. StorageInstanceTreeElement.getElement() checks a RecordObjectCache keyed by getStorageCacheName() + recordId (StorageInstanceTreeElement.java:285-314)
3. CaseInstanceTreeElement.getStorageCacheName() always returns "casedb" (hard-coded to MODEL_NAME) — regardless of which storage the instance wraps (CaseInstanceTreeElement.java:145-147)

The collision: The user's casedb instance AND the results (case-search) instance are BOTH CaseInstanceTreeElement, BOTH return "casedb" from getStorageCacheName(). They share cache keys in
the RecordObjectCache.

What this causes:
- When the real casedb instance bulkReads records (say, recordId=42), the cache stores ("casedb", 42) → someCase — pulled from the user's case table
- When the results instance later asks getElement(42), it hits the cache and gets the casedb's case, NOT its own search-table case
- The two storages are different SQLite tables with independent recordId spaces, but the cache doesn't know that

Why getExternalRoot snapshot saw all-capacity:
- My storage.iterate() hits the SqlStorage directly, bypassing RecordObjectCache
- So it reports the actual table content: 1189 capacity cases

Why the inventory / entity list saw 5 types:
- CaseChildElement.cache() goes through getElement(recordId, context) → hits RecordObjectCache → returns polluted (casedb) data
- The user's casedb has 5 case types, so the polluted results reflect that

The fix (commcare-core):
CaseInstanceTreeElement.getStorageCacheName() must return a key that's unique per storage, not the generic "casedb". Simplest: include the table name or storage identity in the key. E.g.,
return something like "casedb:" + storage.tableName or base it on the instance's getReference().
78 changes: 78 additions & 0 deletions USH-6370-case_search_bug/root_cause_analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# USH-6370 Root Cause: RecordObjectCache key collision

## Summary

When case search results are displayed via an entity list, the `results` data instance (case search) and the `casedb` data instance (user's full case database) are both backed by `CaseInstanceTreeElement`. Both return `"casedb"` from `getStorageCacheName()`, which is used as the key prefix in `RecordObjectCache`. When both instances share the same `QueryContext` during nodeset evaluation, their record lookups collide — one instance reads the other's data from the cache.

## Affected code

- `CaseInstanceTreeElement.getStorageCacheName()` — always returns `"casedb"` (hard-coded `MODEL_NAME`)
- `StorageInstanceTreeElement.getElement(int recordId, QueryContext context)` — uses `getStorageCacheName() + recordId` as cache key in `RecordObjectCache`
- `RecordObjectCache` — per-`QueryContext` cache, shared when two instances are evaluated in the same context scope

## Evidence

### 1. Storage is clean, tree reads are not

Across all failure runs, `getExternalRoot` iterates the SQLite storage directly (`storage.iterate()`) and reports all cases as `capacity`. But the inventory (which reads via `CaseChildElement.cache()` → `getElement()` → `RecordObjectCache`) reports mixed types from the same instance in the same request:

| Checkpoint | Path | Result |
|---|---|---|
| `getExternalRoot` (SqlStorage.iterate) | Direct SQL read | `storageCount=1189 caseTypes=[capacity]` |
| Inventory (CaseChildElement.cache) | Via RecordObjectCache | `{capacity=1047, clinic=2, commcare-case-claim=138, provider=1, unit=1}` |

Same backing storage, same request, same count (1189) — different case types. The divergence is in the cache layer.

### 2. Both instances share the cache key `"casedb"`

Inventory confirmed on every run:

```
casedb ... storageCacheName=casedb
results ... storageCacheName=casedb
```

### 3. Cross-instance cache read on failure, separate caches on success

`RecordObjectCache` bulk-load and hit events captured via `ThreadLocal` diagnostic log:

**Failure run** — two tree elements share `cache=1061465914`:

```
BULK_LOAD key=casedb te=1529968747 storage=1095096645 cache=1061465914 count=1189
HIT key=casedb te=1808671756 storage=1945215430 cache=1061465914 recordId=313
```

`te=1529968747` (one storage) populated the cache. `te=1808671756` (different storage) read from it. **Cross-instance read confirmed.**

**Success run** — each tree element gets its own cache:

```
BULK_LOAD key=casedb te=1107500281 cache=1691588858 count=1189
BULK_LOAD key=casedb te=732472401 cache=2112698688 count=1189
```

Different `cache=` hashes. No cross-instance events.

### 4. Intermittency explained

The collision only occurs when both instances share the same `QueryContext` (and thus the same `RecordObjectCache`). Whether this happens depends on how the evaluation context forks query contexts during nodeset evaluation — which varies by execution path and request state. This is why identical requests from the test script produce different outcomes.

## Fix

`CaseInstanceTreeElement.getStorageCacheName()` must return a key unique per storage instance, not the generic `"casedb"`. The underlying SqlStorage table name is the natural choice — each case search query gets its own table, and the user's casedb has its own.

Example:
```java
// Before (collision):
public String getStorageCacheName() {
return CaseInstanceTreeElement.MODEL_NAME; // always "casedb"
}

// After (unique per storage):
public String getStorageCacheName() {
return CaseInstanceTreeElement.MODEL_NAME + ":" + storage.getStorageIdentifier();
}
```

Where `getStorageIdentifier()` returns the table name for SQL-backed storage.
Loading
Loading