Skip to content

Price histogram for LOWEST_PRICE entities ignores inner-record granularity #1159

@novoj

Description

@novoj

Summary

The priceHistogram extra-result currently treats every entity using PriceInnerRecordHandling.LOWEST_PRICE as a single data point — the entity's price-for-sale (i.e., the lowest winning price across its inner records). This collapses all per-inner-record price variation into one bucket contribution and produces a histogram that does not reflect the actual price distribution of the catalog.

This issue tracks fixing this so that for LOWEST_PRICE entities the histogram receives one data point per inner-record-id (selected by price-list priority, after individualPricePredicate is applied). NONE (all matching prices per entity) and SUM (one cumulated price per entity) keep their current semantics.

Why this is a bug, not a new mode

A LOWEST_PRICE entity (e.g. a master product with multiple variants) carries a family of prices — one per inner-record-id. Each inner-record price is a real, sellable price that the storefront can render. The histogram is supposed to answer the question "what prices are reachable in this candidate pool?" — a UI that uses it to draw a price-slider needs to see those per-variant prices, not just one representative value per master.

Today the histogram answers a strictly narrower question: "what are the prices-for-sale of the matching entities?". That is an artifact of code reuse (the same LowestPriceTerminationFormula.filteredPriceRecords is used for filtering, sorting, and histogram), not an intentional design choice. There is no requirement or use-case where the current behavior is the correct one — the bucket distribution it produces is systematically biased toward the cheapest variant of each entity.

Therefore: this is a defect in the histogram, not a new opt-in behavior. We will not add a new HistogramBehavior enum value or a request-level switch. The behavior simply changes — and is published as a breaking semantic change in the release notes (bucket counts and thresholds against LOWEST_PRICE catalogs will shift after upgrade; clients with hard-coded expectations must re-baseline).

Business goals

  • Storefront UIs that render a price slider from priceHistogram show representative price distributions for catalogs of master/variant products, not "one price per master" distributions that mask the spread.
  • Faceted price filtering driven by histogram output produces buckets that correspond to actual purchasable prices rather than an arbitrary projection.
  • Consistency with SUM and NONE: each strategy contributes the prices that genuinely characterize the entity in that handling mode.

Current behavior — verified evidence

  • PriceHistogramTranslator collects FilteredPriceRecordAccessor instances with LookUp.SHALLOW, so it stops at LowestPriceTerminationFormula near the top of the price subtree.
  • LowestPriceTerminationFormula.computeInternal() already builds a per-inner-record map keyed by innerRecordId, populated in price-list priority order and gated by individualPricePredicate. It then collapses that map to a single lowestPrice per entity and stores only that one in filteredPriceRecords. The per-inner-record values — exactly the data the histogram needs — are computed and discarded today; the fix exposes them rather than recomputing them.
  • PriceHistogramComputer.getPriceRecords() consumes the entity-level array via FilteredPriceRecordsCollector, or reuses the FilteredPricesSorter.getPriceRecordsLookupResult() when a sorter is present (also entity-level only).

Cache flattening payloads involved:

  • FlattenedFormulaWithFilteredPrices — emitted by PlainPriceTerminationFormulaWithPriceFilter (NONE, SUM). Not touched by this issue.
  • FlattenedFormulaWithFilteredPricesAndFilteredOutRecords — emitted by LowestPriceTerminationFormula. Stays as-is for non-histogram queries; a new sibling class is introduced for histogram queries (see mechanics below).

Required behavior

For each scope/scenario:

PriceInnerRecordHandling Histogram data points per entity (after change)
NONE All matching prices (unchanged)
SUM One cumulated price per entity (unchanged)
LOWEST_PRICE One per inner-record-id — the per-inner-record winner selected by price-list priority, after individualPricePredicate
UNKNOWN n/a (data not fetched)

Rough implementation mechanics

This is a sketch — a detailed plan will be authored from this assignment.

1. Planner-driven flag on LowestPriceTerminationFormula

Add a constructor parameter boolean collectPerInnerRecordPrices. Set to true by the planner only when PriceHistogramTranslator.createProducer(...) runs — i.e., only when a priceHistogram requirement is present. The flag participates in the formula's getHash() so cached and live variants don't collide.

This is the load-bearing piece for performance: it ensures the 99% of LOWEST_PRICE queries that don't request a histogram pay zero extra cost.

2. Side-output funnel populated only when flag is true

Inside computeInternal(), while we already iterate entityInnerRecordPrice.values() per entity to find the lowest, also append every value (per-inner-record winner) to a second CompositeObjectArray<PriceRecordContract>gated on the flag. After the loop, materialize into a new ResolvedFilteredPriceRecords perInnerRecordPriceRecords field parallel to filteredPriceRecords. The entityInnerRecordPrice values already respect individualPricePredicate (line 355), so the side-output inherits that filter for free. sellingPricePredicate is not applied — it's the price-between user filter, already relaxed by UserFilterRelaxer.relax in PriceHistogramProducer.fabricate(...).

3. New accessor method

Add FilteredPriceRecords getFilteredPriceRecordsForHistogram(QueryExecutionContext ctx) to FilteredPriceRecordAccessor with a default delegating to getFilteredPriceRecords(...). Override only on:

  • LowestPriceTerminationFormula — returns the new side-output when flag set.
  • The new flattened cache class (see step 5).

PlainPriceTerminationFormulaWithPriceFilter (NONE, SUM) and SumPriceTerminationFormula use the default → no behavior change for them.

4. PriceHistogramComputer rewired

For the simple branch (no filteringFormulaWithFilteredOutRecords): skip FilteredPriceRecordsCollector entirely for LowestPriceTerminationFormula accessors and read accessor.getFilteredPriceRecordsForHistogram(ctx).getPriceRecords() directly. The funnel only contains entities that already passed the predicate (lines 379-386 only push when anyPriceMatchesTheFilter), so no second walk is needed.

For the relaxed-tree branch: use a new histogram-flavored collector (collectFilteredPriceRecordsFromPriceRecordAccessorsForHistogram) on the remainder bitmap only.

The priceRecordsLookupResult shortcut (when a FilteredPricesSorter is present) must be skipped for LOWEST_PRICE, because the sorter's cached lookup holds entity-level prices only.

5. New cache flattening class — not an extension

Introduce a new sibling class to FlattenedFormulaWithFilteredPricesAndFilteredOutRecords (working name: FlattenedFormulaWithFilteredPricesForHistogram). It carries an additional PriceRecord[] perInnerRecordPriceRecords array and overrides the new accessor method to return it.

LowestPriceTerminationFormula.toSerializableFormula(...) chooses which flattened payload to emit based on the flag:

  • Flag false → existing FlattenedFormulaWithFilteredPricesAndFilteredOutRecords (untouched, no serialization-version bump, no eviction).
  • Flag true → new FlattenedFormulaWithFilteredPricesForHistogram.

The new class owns its own size estimator that charges perInnerRecordPriceRecords.length * PRICE_RECORD_SIZE correctly so the formula cache byte budget is not silently exceeded.

Net effect on cache:

  • Non-histogram LOWEST_PRICE queries: zero change. Existing cache entries remain valid across upgrade.
  • Histogram LOWEST_PRICE queries: separate cache slot (different hash because of the flag), larger payload that correctly reflects what's stored.

6. Tests

  • Per-handling-mode functional test: an entity with N inner-record prices under LOWEST_PRICE produces N histogram data points; SUM produces one cumulated value; NONE produces all matching prices.
  • Mixed-catalog test: histogram bucket counts/thresholds across mixed handling modes match expectations.
  • Cache test: histogram query is correctly served from cache on second invocation; non-histogram query against the same filter still hits the unchanged cache slot.
  • Audit existing histogram tests against LOWEST_PRICE fixtures — many will need updated expectations.

7. Release notes / breaking change entry

Mark as breaking change. Document that priceHistogram over LOWEST_PRICE catalogs now reflects per-inner-record prices, and bucket distributions will shift. Storefronts with hardcoded bucket expectations or end-to-end snapshot tests must rebaseline.

Performance considerations

LowestPriceTerminationFormula.computeInternal() is on the hot path of every priceForSale-bearing query, not just histogram queries. The implementation plan must be vigilant about:

  • Memory allocations. The side-output adds up to N entries per entity (N = number of inner records) where today there is 1. Allocation patterns must amortize cleanly — chunked buffers over growing/copying arrays, reuse of intermediate maps already in scope, reference sharing instead of deep copies.
  • Algorithmic complexity. The fix must not add a second pass over the entity set or duplicate the per-inner-record iteration already happening. The data needed for the histogram is computed today and discarded — the goal is to expose it, not recompute it.
  • No cost when not asked for. Queries that don't request a priceHistogram must pay zero extra cost (no extra allocations, no extra branches in the inner loop, no growth of the cached formula payload). This drives the planner-flag and separate-flattened-class decisions above.
  • Cache impact. The formula cache is sized in bytes; any payload growth must be reflected in the size estimator. Existing cache entries for non-histogram queries must remain valid across upgrade — no serialization-version bump on shared types.

A focused performance review of the initial sketch produced the design choices documented in the mechanics section (planner flag, direct read of side-output, new sibling flattened class). The implementation plan must preserve those invariants and validate them with allocation profiling and microbenchmarks on a representative LOWEST_PRICE catalog before merge.

Alternatives explicitly considered and rejected

  • Flat int[] + per-entity offsets instead of PriceRecordContract[]. Rejected: the histogram cruncher consumes both price and entity PK from PriceRecordContract; refactoring its API saves negligible memory given reference copies are 8 bytes.
  • Always populate the side-output, skip the flag. Rejected: regresses every non-histogram LOWEST_PRICE query.
  • Extend the existing flattened class with an optional field. Rejected in favor of a new sibling class — keeps the non-histogram cache path byte-for-byte identical.
  • Cache-miss-on-histogram recompute instead of carrying the per-inner-record array in the cache payload. Rejected: defeats the formula cache for histogram queries entirely.

Acceptance criteria

  • For LOWEST_PRICE entities, the histogram reflects per-inner-record winning prices (one per inner-record-id, selected by price-list priority, after individualPricePredicate).
  • NONE and SUM behavior is unchanged (regression-tested).
  • Non-histogram LOWEST_PRICE queries show no measurable latency or allocation regression in benchmarks.
  • Formula cache entries for non-histogram LOWEST_PRICE queries are unchanged (no serialization-version bump on FlattenedFormulaWithFilteredPricesAndFilteredOutRecords).
  • Formula cache size estimator for the new histogram-flavored flattened class correctly charges the per-inner-record array.
  • Release notes include a breaking-change entry documenting the shift in histogram buckets for LOWEST_PRICE catalogs.

Metadata

Metadata

Assignees

Labels

breaking changeBackward incompatible data model changebugSomething isn't workingjavaPull requests that update java codeperformancePerformance problem.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions