Datafn query execution: IN_MEMORY classification for groupBy/having and search, and the path to aggregate pushdown

## Context

This note captures the current behavior in the `datafn/server` query planner and execution pipeline, based on tracing the implementation in the codebase.

The original questions were:

* Why does the server use `IN_MEMORY` for `groupBy` and `having`?
* Why does the server use `IN_MEMORY` for `search`? Is the `searchfn`/search provider path being skipped when available?
* What would it take to support true aggregate pushdown instead of in-memory aggregation?

## Short Answer

`IN_MEMORY` in the current planner means the query cannot be expressed entirely through the adapter pushdown path that exists today. It does not necessarily mean a provider is ignored or that the feature is unsupported.

For `groupBy` and `having`:

* The server has no aggregate pushdown API at the adapter boundary.
* Aggregate execution is implemented in-memory by loading records, grouping them in JavaScript, computing aggregates, then applying `having` as a post-aggregation filter.

For `search`:

* The planner still classifies search queries as `IN_MEMORY`.
* But at execution time, the server does use the configured `searchProvider` when present.
* The search provider returns candidate IDs first, and then the server runs the normal DFQL execution pipeline over those candidates.
* So search retrieval can be provider-backed, while final DFQL shaping still happens in the store-backed/in-memory execution path.

## Where The Classification Happens

The planner explicitly classifies the following as `IN_MEMORY`:

* `search`
* `groupBy`
* `having`
* relation-heavy filter patterns such as `$any`, `$all`, `$none`
* dot-path filters
* HTREE select tokens

This is the key current logic:

```ts
if (query.search) return "IN_MEMORY";
if (query.groupBy || query.having) return "IN_MEMORY";
```

The important implication is that the planner is not asking whether the operation is semantically valid. It is asking whether the query can be executed by the existing pushdown machinery.

## Why `groupBy` / `having` Are In-Memory Today

### 1\. The pushdown path only knows how to do row retrieval, not grouped aggregation

The current pushdown implementation is built around:

* converting DFQL filters to adapter `where` clauses
* converting DFQL sort to adapter `orderBy`
* calling `db.findMany(...)`
* optionally calling `db.count(...)`
* projecting fields afterward

That means pushdown today is designed around record retrieval, not grouped result production.

There is no adapter method in the current contract that expresses:

* `GROUP BY`
* aggregate functions like `count`, `sum`, `avg`, `min`, `max`
* `HAVING`
* aggregate-aware sorting
* grouped pagination semantics

Because that capability is missing at the adapter boundary, the server has nowhere to push the aggregate query down to.

### 2\. The actual aggregate executor is explicitly in-memory

Aggregate query execution currently does the following:

1. starts from loaded records
2. applies normal filters in-memory
3. groups records into a `Map`
4. computes each aggregation over the grouped records
5. applies `having` as a filter over the grouped rows
6. sorts grouped rows
7. applies pagination / cursor logic

This is an in-process aggregate engine, not a translation layer to the DB.

### 3\. `having` depends on post-aggregation rows

The current `having` behavior is implemented over rows that only exist after grouping and aggregation. In the current architecture, those rows are not represented in the adapter API. So `having` cannot be pushed independently; it is coupled to the lack of aggregate pushdown overall.

## Why `search` Is Also Classified As `IN_MEMORY`

### 1\. Search is treated as a two-phase operation

The current search pipeline works like this:

1. obtain candidate IDs using either:
   * the configured `searchProvider`, or
   * a DB-native full-text fallback when the adapter reports native support
2. merge the candidate IDs into a filter like `id in [...]`
3. re-run the main DFQL execution path for filtering, shaping, sorting, pagination, and projection

So the planner marks search as `IN_MEMORY` because the query is not fully expressible via the normal adapter pushdown path alone.

### 2\. The provider is still used when available

The server-side execution branch does not skip the provider. It checks:

* if `searchProvider` exists, call it
* else if DB native full-text support exists, use that fallback
* else return unsupported

That means the provider is absolutely part of execution when configured.

### 3\. Why the classification still says `IN_MEMORY`

The label is really about the overall execution strategy, not just candidate retrieval.

Even when the provider is used, the server still performs a follow-up DFQL execution step over the matched candidates. That is why search remains in the `IN_MEMORY` bucket in the current planner design.

## Important Clarification About Validation

There is some stale wording in validation comments/messages that refers to requiring a `searchfn` plugin. The actual server configuration path is based on `searchProvider`, with DB-native fallback recognized as well.

So conceptually:

* old wording suggests a plugin-only model
* current implementation supports a pluggable `searchProvider` and DB-native full-text fallback

The execution wiring reflects the newer model even if some comments still use older language.

## What The Current Architecture Is Optimized For

The present architecture cleanly supports three broad strategies:

* full pushdown for simple base-resource queries
* partial pushdown for queries where the primary resource can be fetched efficiently but relation expansion must happen afterward
* in-memory/store-backed execution for everything that requires more expressive semantics than the adapter contract currently exposes

This is a reasonable design boundary. The issue is not that aggregation or search are invalid. The issue is that the adapter contract and pushdown layer are narrower than the full DFQL language.

## What Would Need To Change For True Aggregate Pushdown

### 1\. Extend the adapter contract

The first necessary change is at the database adapter boundary. A new aggregate-oriented method would be needed, something conceptually like:

* `aggregate(params)`

That method would need to express:

* `model`
* `where`
* `groupBy`
* `aggregations`
* `having`
* `orderBy`
* `limit`
* `offset`
* `namespace`
* possibly cursor semantics later

Without this, the server has no stable way to ask an adapter for grouped results.

### 2\. Add adapter capability flags

As with native full-text support today, aggregate support should be feature-detected. Examples:

* `capabilities.operations.aggregate`
* potentially finer-grained flags such as:
  * supports `having`
  * supports aggregate sort by alias
  * supports base-field groupBy only
  * supports dot-path or relation-aware grouping

This matters because not every adapter will support the full surface area at once.

### 3\. Add a new server execution path

The server would need a new execution branch for aggregate-capable adapters.

Instead of always sending `groupBy` / `having` to the in-memory executor, the server would do something like:

* if query has aggregate semantics and adapter supports aggregate pushdown, execute aggregate pushdown
* otherwise keep the current in-memory fallback

This preserves compatibility while allowing adapters to opt in incrementally.

### 4\. Build DFQL-to-adapter aggregate translation

The server would need a translation layer analogous to the existing filter/sort pushdown converter. That translation would need to handle:

* pre-group `filters` -> adapter `where`
* `groupBy`
* aggregate definitions and aliases
* `having` on grouped fields and aggregate aliases
* sort terms that may reference group keys or aggregate aliases
* grouped pagination semantics

### 5\. Preserve Datafn semantics exactly

One subtle but important requirement: the pushdown path must preserve the existing Datafn aggregate semantics.

The current shared aggregate utility intentionally defines `count` as counting records, not non-null field values. SQL adapters often distinguish:

* `COUNT(*)`
* `COUNT(field)`

If the new pushdown path is implemented carelessly, it could silently diverge from current in-memory results. So aggregate pushdown must match the current behavior exactly, even when the underlying SQL engine has different defaults.

## Recommended Scope For A First Aggregate Pushdown Version

A practical first version would likely support only the simplest safe subset:

* base-resource `groupBy` only
* no relation traversal in `groupBy`
* no relation expansion in `select` for aggregate queries
* `having` only on group keys and aggregate aliases
* sort by group keys and aggregate aliases only
* limit/offset pagination before tackling full cursor semantics

This would unlock meaningful performance improvements without forcing the entire DFQL relation model into the first adapter API revision.

## Why Keeping The In-Memory Fallback Still Matters

Even after aggregate pushdown exists, the in-memory path should remain for:

* adapters with no aggregate support
* memory or test adapters
* complex aggregate queries the adapter cannot express yet
* relation-aware grouping cases beyond the initial scope

That keeps the language portable while allowing more capable adapters to accelerate common cases.

## Practical Architectural Reading

The current system should be read as:

* `FULL_PUSHDOWN`: pure adapter-supported retrieval query
* `PARTIAL_PUSHDOWN`: adapter-supported base retrieval plus relation expansion afterward
* `IN_MEMORY`: a broader execution bucket for queries whose semantics exceed the current adapter pushdown contract

That last category includes both:

* true in-memory computation like grouped aggregation
* hybrid flows like provider-backed search plus store-backed DFQL shaping

So the label is slightly broader than the literal phrase suggests.

## Why `@superfunctions/db` Is The Real Boundary

This point deserves to be explicit because it changes where the fix should live.

The root cause is not just that `datafn/server` lacks aggregate pushdown code. The deeper cause is that the upstream adapter contract in `@superfunctions/db` does not currently expose any grouped query primitive.

Today, the `Adapter` interface supports:

* `findOne`
* `findMany`
* `count`
* CRUD and batch operations
* transactions
* schema/lifecycle methods

It does not support anything like:

* `aggregate(...)`
* `groupBy(...)`
* grouped result rows
* aggregate-aware `having`
* grouped ordering and grouped pagination

That means `datafn/server` cannot cleanly implement true aggregate pushdown on its own without either:

* extending `@superfunctions/db`, or
* bypassing the adapter abstraction and re-implementing DB-specific grouped query logic inside Datafn

The second option would be the wrong design move. It would duplicate responsibilities, create adapter-specific logic in the wrong layer, and make it much harder to preserve consistent semantics across adapters.

### Why This Matters For Anyone Fixing The Issue

A future implementation should not start by adding aggregate SQL directly inside `datafn/server` pushdown code as a one-off workaround.

That would likely cause all of the following:

* duplication of query capabilities that belong in `@superfunctions/db`
* inconsistent behavior between adapters
* difficulty supporting non-SQL or partial-capability adapters
* risk of semantic drift from the in-memory aggregate implementation already shared by Datafn

The correct architectural reading is:

* `@superfunctions/db` defines what can be pushed down
* `datafn/server` translates DFQL into that contract
* if the contract does not express grouped queries, Datafn must fall back to in-memory aggregation

### Why Search Looks Different

Search already has a narrow foothold in the `@superfunctions/db` capability model because adapter capabilities include `operations.fulltext`.

That allows Datafn to distinguish:

* provider-backed search
* DB-native full-text fallback
* unsupported adapters

There is no equivalent aggregate capability today. That is a strong signal that aggregate pushdown is not merely missing in Datafn, but missing at the adapter layer itself.

### Semantics Risk If This Is Implemented In The Wrong Layer

The shared aggregate semantics in Datafn are not automatically the same as native SQL semantics.

A good example is `count`:

* the current shared aggregate utility defines `count` as counting records
* many SQL implementations distinguish `COUNT(*)` from `COUNT(field)` and handle nulls differently

If grouped queries are implemented ad hoc only inside Datafn pushdown logic, different adapters could easily end up with subtly different results.

Putting aggregate support into `@superfunctions/db` first creates a single place to define and normalize those semantics.

## Package-By-Package Implementation Checklist

### 1\. `packages/db`

This is the first place that should change.

Add an aggregate query contract to the adapter system, for example:

* `aggregate(params: AggregateParams): Promise<AggregateRow[]>`

Define the supporting types for:

* group keys
* aggregate operations and aliases
* `having`
* grouped `orderBy`
* limit/offset
* namespace
* optional future cursor semantics

Extend adapter capabilities with at least:

* `operations.aggregate`

Potentially add finer-grained capability flags if needed, such as:

* supports `having`
* supports aggregate sort by alias
* supports grouped pagination
* supports only base-field `groupBy`

### 2\. `packages/db` adapter implementations

After the contract exists, implement it in the concrete adapters where feasible:

* Drizzle adapter
* Kysely adapter
* Prisma adapter
* memory adapter as a reference/fallback implementation if it makes sense

The implementation must preserve Datafn’s existing aggregate semantics rather than leaking raw ORM/SQL defaults.

### 3\. `datafn/server`

Only after the adapter contract exists should Datafn’s pushdown path be extended.

Changes would include:

* adding DFQL-to-adapter aggregate translation
* branching to aggregate pushdown when the adapter reports support
* keeping the current in-memory aggregate executor as fallback
* updating planner classification so `groupBy`/`having` are not unconditional `IN_MEMORY` when aggregate pushdown is available

### 4\. `datafn/core`

Keep `datafn/core` as the source of shared aggregate semantics.

If new aggregate behavior is introduced, validate whether the shared utility remains the semantic reference for:

* `count`
* null handling
* alias behavior
* aggregate result row shape

That way pushdown and in-memory paths stay aligned.

### 5\. Docs and tests

Update docs and tests in both layers:

* `@superfunctions/db` docs for the new adapter API
* Datafn docs for aggregate pushdown behavior and fallback behavior
* adapter contract tests for grouped queries
* Datafn integration tests that assert pushdown and in-memory paths return the same results

## Guardrail For Future Work

A good rule for anyone implementing this:

If the feature requires expressing a new class of database query, first ask whether `@superfunctions/db` can represent it.

If the answer is no, the work should begin in `packages/db`, not in `datafn/server`.

## Conclusion

The current behavior is internally consistent:

* `groupBy` and `having` are `IN_MEMORY` because grouped aggregation is implemented entirely in the server and the adapter contract has no aggregate query primitive.
* `search` is also classified as `IN_MEMORY`, but that does not mean the provider is ignored. The provider is used to retrieve candidate IDs; the final DFQL query is then executed over those candidates in the normal server pipeline.
* The missing piece is not only in Datafn. The real boundary is `@superfunctions/db`, which currently lacks an aggregate query contract and aggregate capability flags.
* True aggregate pushdown should therefore start in `packages/db`, then be consumed by `datafn/server`, rather than being reimplemented as duplicate DB logic in Datafn.

## Suggested Follow-Up

If we want to take this from analysis to implementation, the clean next step is a small design/spec pass that defines:

* the aggregate adapter method signature in `@superfunctions/db`
* capability flags in `@superfunctions/db`
* exact semantics for `count`, `having`, sorting, and pagination
* the minimum supported aggregate subset for v1
* the Datafn-side execution and fallback plan once the adapter contract exists

That would let the implementation proceed without ambiguity, without duplication across layers, and without accidental semantic drift from the current in-memory executor.

Datafn query execution: IN_MEMORY classification for groupBy/having and search, and the path to aggregate pushdown #90

Description

Context

Short Answer

Where The Classification Happens

Why groupBy / having Are In-Memory Today

1. The pushdown path only knows how to do row retrieval, not grouped aggregation

2. The actual aggregate executor is explicitly in-memory

3. having depends on post-aggregation rows

Why search Is Also Classified As IN_MEMORY

1. Search is treated as a two-phase operation

2. The provider is still used when available

3. Why the classification still says IN_MEMORY

Important Clarification About Validation

What The Current Architecture Is Optimized For

What Would Need To Change For True Aggregate Pushdown

1. Extend the adapter contract

2. Add adapter capability flags

3. Add a new server execution path

4. Build DFQL-to-adapter aggregate translation

5. Preserve Datafn semantics exactly

Recommended Scope For A First Aggregate Pushdown Version

Why Keeping The In-Memory Fallback Still Matters

Practical Architectural Reading

Why @superfunctions/db Is The Real Boundary

Why This Matters For Anyone Fixing The Issue

Why Search Looks Different

Semantics Risk If This Is Implemented In The Wrong Layer

Package-By-Package Implementation Checklist

1. packages/db

2. packages/db adapter implementations

3. datafn/server

4. datafn/core

5. Docs and tests

Guardrail For Future Work

Conclusion

Suggested Follow-Up

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Why `groupBy` / `having` Are In-Memory Today

3. `having` depends on post-aggregation rows

Why `search` Is Also Classified As `IN_MEMORY`

3. Why the classification still says `IN_MEMORY`

Why `@superfunctions/db` Is The Real Boundary

1. `packages/db`

2. `packages/db` adapter implementations

3. `datafn/server`

4. `datafn/core`