Skip to content

Datafn query execution: IN_MEMORY classification for groupBy/having and search, and the path to aggregate pushdown #90

@thyaravind

Description

@thyaravind

Context

This note captures the current behavior in the datafn/server query planner and execution pipeline, based on tracing the implementation in the codebase.

The original questions were:

  • Why does the server use IN_MEMORY for groupBy and having?
  • Why does the server use IN_MEMORY for search? Is the searchfn/search provider path being skipped when available?
  • What would it take to support true aggregate pushdown instead of in-memory aggregation?

Short Answer

IN_MEMORY in the current planner means the query cannot be expressed entirely through the adapter pushdown path that exists today. It does not necessarily mean a provider is ignored or that the feature is unsupported.

For groupBy and having:

  • The server has no aggregate pushdown API at the adapter boundary.
  • Aggregate execution is implemented in-memory by loading records, grouping them in JavaScript, computing aggregates, then applying having as a post-aggregation filter.

For search:

  • The planner still classifies search queries as IN_MEMORY.
  • But at execution time, the server does use the configured searchProvider when present.
  • The search provider returns candidate IDs first, and then the server runs the normal DFQL execution pipeline over those candidates.
  • So search retrieval can be provider-backed, while final DFQL shaping still happens in the store-backed/in-memory execution path.

Where The Classification Happens

The planner explicitly classifies the following as IN_MEMORY:

  • search
  • groupBy
  • having
  • relation-heavy filter patterns such as $any, $all, $none
  • dot-path filters
  • HTREE select tokens

This is the key current logic:

if (query.search) return "IN_MEMORY";
if (query.groupBy || query.having) return "IN_MEMORY";

The important implication is that the planner is not asking whether the operation is semantically valid. It is asking whether the query can be executed by the existing pushdown machinery.

Why groupBy / having Are In-Memory Today

1. The pushdown path only knows how to do row retrieval, not grouped aggregation

The current pushdown implementation is built around:

  • converting DFQL filters to adapter where clauses
  • converting DFQL sort to adapter orderBy
  • calling db.findMany(...)
  • optionally calling db.count(...)
  • projecting fields afterward

That means pushdown today is designed around record retrieval, not grouped result production.

There is no adapter method in the current contract that expresses:

  • GROUP BY
  • aggregate functions like count, sum, avg, min, max
  • HAVING
  • aggregate-aware sorting
  • grouped pagination semantics

Because that capability is missing at the adapter boundary, the server has nowhere to push the aggregate query down to.

2. The actual aggregate executor is explicitly in-memory

Aggregate query execution currently does the following:

  1. starts from loaded records
  2. applies normal filters in-memory
  3. groups records into a Map
  4. computes each aggregation over the grouped records
  5. applies having as a filter over the grouped rows
  6. sorts grouped rows
  7. applies pagination / cursor logic

This is an in-process aggregate engine, not a translation layer to the DB.

3. having depends on post-aggregation rows

The current having behavior is implemented over rows that only exist after grouping and aggregation. In the current architecture, those rows are not represented in the adapter API. So having cannot be pushed independently; it is coupled to the lack of aggregate pushdown overall.

Why search Is Also Classified As IN_MEMORY

1. Search is treated as a two-phase operation

The current search pipeline works like this:

  1. obtain candidate IDs using either:
    • the configured searchProvider, or
    • a DB-native full-text fallback when the adapter reports native support
  2. merge the candidate IDs into a filter like id in [...]
  3. re-run the main DFQL execution path for filtering, shaping, sorting, pagination, and projection

So the planner marks search as IN_MEMORY because the query is not fully expressible via the normal adapter pushdown path alone.

2. The provider is still used when available

The server-side execution branch does not skip the provider. It checks:

  • if searchProvider exists, call it
  • else if DB native full-text support exists, use that fallback
  • else return unsupported

That means the provider is absolutely part of execution when configured.

3. Why the classification still says IN_MEMORY

The label is really about the overall execution strategy, not just candidate retrieval.

Even when the provider is used, the server still performs a follow-up DFQL execution step over the matched candidates. That is why search remains in the IN_MEMORY bucket in the current planner design.

Important Clarification About Validation

There is some stale wording in validation comments/messages that refers to requiring a searchfn plugin. The actual server configuration path is based on searchProvider, with DB-native fallback recognized as well.

So conceptually:

  • old wording suggests a plugin-only model
  • current implementation supports a pluggable searchProvider and DB-native full-text fallback

The execution wiring reflects the newer model even if some comments still use older language.

What The Current Architecture Is Optimized For

The present architecture cleanly supports three broad strategies:

  • full pushdown for simple base-resource queries
  • partial pushdown for queries where the primary resource can be fetched efficiently but relation expansion must happen afterward
  • in-memory/store-backed execution for everything that requires more expressive semantics than the adapter contract currently exposes

This is a reasonable design boundary. The issue is not that aggregation or search are invalid. The issue is that the adapter contract and pushdown layer are narrower than the full DFQL language.

What Would Need To Change For True Aggregate Pushdown

1. Extend the adapter contract

The first necessary change is at the database adapter boundary. A new aggregate-oriented method would be needed, something conceptually like:

  • aggregate(params)

That method would need to express:

  • model
  • where
  • groupBy
  • aggregations
  • having
  • orderBy
  • limit
  • offset
  • namespace
  • possibly cursor semantics later

Without this, the server has no stable way to ask an adapter for grouped results.

2. Add adapter capability flags

As with native full-text support today, aggregate support should be feature-detected. Examples:

  • capabilities.operations.aggregate
  • potentially finer-grained flags such as:
    • supports having
    • supports aggregate sort by alias
    • supports base-field groupBy only
    • supports dot-path or relation-aware grouping

This matters because not every adapter will support the full surface area at once.

3. Add a new server execution path

The server would need a new execution branch for aggregate-capable adapters.

Instead of always sending groupBy / having to the in-memory executor, the server would do something like:

  • if query has aggregate semantics and adapter supports aggregate pushdown, execute aggregate pushdown
  • otherwise keep the current in-memory fallback

This preserves compatibility while allowing adapters to opt in incrementally.

4. Build DFQL-to-adapter aggregate translation

The server would need a translation layer analogous to the existing filter/sort pushdown converter. That translation would need to handle:

  • pre-group filters -> adapter where
  • groupBy
  • aggregate definitions and aliases
  • having on grouped fields and aggregate aliases
  • sort terms that may reference group keys or aggregate aliases
  • grouped pagination semantics

5. Preserve Datafn semantics exactly

One subtle but important requirement: the pushdown path must preserve the existing Datafn aggregate semantics.

The current shared aggregate utility intentionally defines count as counting records, not non-null field values. SQL adapters often distinguish:

  • COUNT(*)
  • COUNT(field)

If the new pushdown path is implemented carelessly, it could silently diverge from current in-memory results. So aggregate pushdown must match the current behavior exactly, even when the underlying SQL engine has different defaults.

Recommended Scope For A First Aggregate Pushdown Version

A practical first version would likely support only the simplest safe subset:

  • base-resource groupBy only
  • no relation traversal in groupBy
  • no relation expansion in select for aggregate queries
  • having only on group keys and aggregate aliases
  • sort by group keys and aggregate aliases only
  • limit/offset pagination before tackling full cursor semantics

This would unlock meaningful performance improvements without forcing the entire DFQL relation model into the first adapter API revision.

Why Keeping The In-Memory Fallback Still Matters

Even after aggregate pushdown exists, the in-memory path should remain for:

  • adapters with no aggregate support
  • memory or test adapters
  • complex aggregate queries the adapter cannot express yet
  • relation-aware grouping cases beyond the initial scope

That keeps the language portable while allowing more capable adapters to accelerate common cases.

Practical Architectural Reading

The current system should be read as:

  • FULL_PUSHDOWN: pure adapter-supported retrieval query
  • PARTIAL_PUSHDOWN: adapter-supported base retrieval plus relation expansion afterward
  • IN_MEMORY: a broader execution bucket for queries whose semantics exceed the current adapter pushdown contract

That last category includes both:

  • true in-memory computation like grouped aggregation
  • hybrid flows like provider-backed search plus store-backed DFQL shaping

So the label is slightly broader than the literal phrase suggests.

Why @superfunctions/db Is The Real Boundary

This point deserves to be explicit because it changes where the fix should live.

The root cause is not just that datafn/server lacks aggregate pushdown code. The deeper cause is that the upstream adapter contract in @superfunctions/db does not currently expose any grouped query primitive.

Today, the Adapter interface supports:

  • findOne
  • findMany
  • count
  • CRUD and batch operations
  • transactions
  • schema/lifecycle methods

It does not support anything like:

  • aggregate(...)
  • groupBy(...)
  • grouped result rows
  • aggregate-aware having
  • grouped ordering and grouped pagination

That means datafn/server cannot cleanly implement true aggregate pushdown on its own without either:

  • extending @superfunctions/db, or
  • bypassing the adapter abstraction and re-implementing DB-specific grouped query logic inside Datafn

The second option would be the wrong design move. It would duplicate responsibilities, create adapter-specific logic in the wrong layer, and make it much harder to preserve consistent semantics across adapters.

Why This Matters For Anyone Fixing The Issue

A future implementation should not start by adding aggregate SQL directly inside datafn/server pushdown code as a one-off workaround.

That would likely cause all of the following:

  • duplication of query capabilities that belong in @superfunctions/db
  • inconsistent behavior between adapters
  • difficulty supporting non-SQL or partial-capability adapters
  • risk of semantic drift from the in-memory aggregate implementation already shared by Datafn

The correct architectural reading is:

  • @superfunctions/db defines what can be pushed down
  • datafn/server translates DFQL into that contract
  • if the contract does not express grouped queries, Datafn must fall back to in-memory aggregation

Why Search Looks Different

Search already has a narrow foothold in the @superfunctions/db capability model because adapter capabilities include operations.fulltext.

That allows Datafn to distinguish:

  • provider-backed search
  • DB-native full-text fallback
  • unsupported adapters

There is no equivalent aggregate capability today. That is a strong signal that aggregate pushdown is not merely missing in Datafn, but missing at the adapter layer itself.

Semantics Risk If This Is Implemented In The Wrong Layer

The shared aggregate semantics in Datafn are not automatically the same as native SQL semantics.

A good example is count:

  • the current shared aggregate utility defines count as counting records
  • many SQL implementations distinguish COUNT(*) from COUNT(field) and handle nulls differently

If grouped queries are implemented ad hoc only inside Datafn pushdown logic, different adapters could easily end up with subtly different results.

Putting aggregate support into @superfunctions/db first creates a single place to define and normalize those semantics.

Package-By-Package Implementation Checklist

1. packages/db

This is the first place that should change.

Add an aggregate query contract to the adapter system, for example:

  • aggregate(params: AggregateParams): Promise<AggregateRow[]>

Define the supporting types for:

  • group keys
  • aggregate operations and aliases
  • having
  • grouped orderBy
  • limit/offset
  • namespace
  • optional future cursor semantics

Extend adapter capabilities with at least:

  • operations.aggregate

Potentially add finer-grained capability flags if needed, such as:

  • supports having
  • supports aggregate sort by alias
  • supports grouped pagination
  • supports only base-field groupBy

2. packages/db adapter implementations

After the contract exists, implement it in the concrete adapters where feasible:

  • Drizzle adapter
  • Kysely adapter
  • Prisma adapter
  • memory adapter as a reference/fallback implementation if it makes sense

The implementation must preserve Datafn’s existing aggregate semantics rather than leaking raw ORM/SQL defaults.

3. datafn/server

Only after the adapter contract exists should Datafn’s pushdown path be extended.

Changes would include:

  • adding DFQL-to-adapter aggregate translation
  • branching to aggregate pushdown when the adapter reports support
  • keeping the current in-memory aggregate executor as fallback
  • updating planner classification so groupBy/having are not unconditional IN_MEMORY when aggregate pushdown is available

4. datafn/core

Keep datafn/core as the source of shared aggregate semantics.

If new aggregate behavior is introduced, validate whether the shared utility remains the semantic reference for:

  • count
  • null handling
  • alias behavior
  • aggregate result row shape

That way pushdown and in-memory paths stay aligned.

5. Docs and tests

Update docs and tests in both layers:

  • @superfunctions/db docs for the new adapter API
  • Datafn docs for aggregate pushdown behavior and fallback behavior
  • adapter contract tests for grouped queries
  • Datafn integration tests that assert pushdown and in-memory paths return the same results

Guardrail For Future Work

A good rule for anyone implementing this:

If the feature requires expressing a new class of database query, first ask whether @superfunctions/db can represent it.

If the answer is no, the work should begin in packages/db, not in datafn/server.

Conclusion

The current behavior is internally consistent:

  • groupBy and having are IN_MEMORY because grouped aggregation is implemented entirely in the server and the adapter contract has no aggregate query primitive.
  • search is also classified as IN_MEMORY, but that does not mean the provider is ignored. The provider is used to retrieve candidate IDs; the final DFQL query is then executed over those candidates in the normal server pipeline.
  • The missing piece is not only in Datafn. The real boundary is @superfunctions/db, which currently lacks an aggregate query contract and aggregate capability flags.
  • True aggregate pushdown should therefore start in packages/db, then be consumed by datafn/server, rather than being reimplemented as duplicate DB logic in Datafn.

Suggested Follow-Up

If we want to take this from analysis to implementation, the clean next step is a small design/spec pass that defines:

  • the aggregate adapter method signature in @superfunctions/db
  • capability flags in @superfunctions/db
  • exact semantics for count, having, sorting, and pagination
  • the minimum supported aggregate subset for v1
  • the Datafn-side execution and fallback plan once the adapter contract exists

That would let the implementation proceed without ambiguity, without duplication across layers, and without accidental semantic drift from the current in-memory executor.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions