Skip to content

feat(query-enhancements): PPL lint backend — feature flag + explain/calcite proxy routes#12255

Open
Hanyu-W wants to merge 4 commits into
opensearch-project:mainfrom
Hanyu-W:ppl-lint-backend
Open

feat(query-enhancements): PPL lint backend — feature flag + explain/calcite proxy routes#12255
Hanyu-W wants to merge 4 commits into
opensearch-project:mainfrom
Hanyu-W:ppl-lint-backend

Conversation

@Hanyu-W

@Hanyu-W Hanyu-W commented Jun 18, 2026

Copy link
Copy Markdown

Description

Backend plumbing for the PPL linter, adding the feature flag and two read-only proxy routes.

  • Feature flagqueryEnhancements.pplLint capability, default false, resolved at runtime via a DynamicConfigService capability switcher (same pattern as explore).
  • POST /api/enhancements/ppl/explain — proxies a PPL query to /_plugins/_ppl/_explain and returns the Calcite execution plan. Validates a non-empty query; supports optional dataSourceId.
  • GET /api/enhancements/ppl/calcite_settings — reads /_cluster/settings (scoped via filter_path) and returns { calciteEnabled, allJoinTypesAllowed }. Fails open on errors so a settings-read failure never blocks the editor; logs 401/403 at warn.

Issues Resolved

Backend plumbing for opensearch-project/sql#5405

Screenshot

N/A — no UI changes.

Testing the changes

  1. Run unit tests for the new routes and capability switcher:
    node scripts/jest.js \
      src/plugins/query_enhancements/server/plugin.test.ts \
      src/plugins/query_enhancements/server/routes/ppl_calcite_settings.test.ts \
      src/plugins/query_enhancements/server/routes/ppl_explain.test.ts
  2. Run the full plugin server suite to confirm no regressions:
    node scripts/jest.js src/plugins/query_enhancements/server
  3. Point the dev server at a live opensearch-sql cluster and hit GET /api/enhancements/ppl/calcite_settings — verify the scoped filter_path response matches the full unfiltered /_cluster/settings output for calciteEnabled and allJoinTypesAllowed.

Check List

  • All tests pass
    • yarn test:jest
    • yarn test:jest_integration
  • New functionality includes testing
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

…xy routes

Add the disabled-by-default queryEnhancements.pplLint capability
(DynamicConfigService switcher, mirrors agent_traces) and two read-only
OpenSearch proxy routes (_ppl/_explain, _cluster/settings) that the PPL
linter will consume. Feature is OFF and inert; no client wiring yet.

Signed-off-by: Hanyu Wei <weihanyu@amazon.com>

return res.ok({
body: {
calciteEnabled: resolveValue('plugins.calcite.enabled') !== 'false',

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p0: backward-compat issue with calciteEnabled default

resolveValue(...) !== 'false' returns true when the key is absent (undefined !== 'false'). On an older opensearch-sql cluster that predates plugins.calcite.enabled, the key is genuinely absent and the settings read succeeds (so we are on this success path, not the catch block), yet we report calciteEnabled: true for a cluster that has no Calcite engine at all.

The request already sends include_defaults=true, so any cluster that knows the setting surfaces it in the defaults bucket even when unset. Absence therefore reliably indicates that the cluster does not have Calcite, and the safe interpretation is the opposite of the current default. Since calciteEnabled gates the explain-based lint rules, the current behavior makes the editor fire _explain calls on every keystroke against clusters that cannot support them.

Suggested fix is to invert the success-path default:

calciteEnabled: resolveValue('plugins.calcite.enabled') === 'true',

New clusters still resolve correctly (the default is surfaced); old clusters correctly resolve to false. The catch-block fail-open (true) is a separate decision and can stay as-is. Worth confirming the intended old-cluster behavior with the SQL team either way.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@Hanyu-W Hanyu-W Jun 22, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Confirmed that plugins.calcite.enabled is registered (with default true) only when the SQL plugin loads. I inverted the check to === 'true', left the catch-block fail-open (true) intact and added a comment: a 200 with the key absent is definitive "no Calcite," whereas an error can't distinguish "no plugin" from "transient failure." Side effect: calciteEnabled now matches the === 'true' idiom already used by allJoinTypesAllowed.

method: 'GET',
path: EXPECTED_PATH,
});
// Absent setting => calcite treated as enabled (not 'false'), join types not allowed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p0: test locks in the absent-setting default

This test currently encodes the behavior flagged on ppl_calcite_settings.ts (// Absent setting => calcite treated as enabled, asserting calciteEnabled: true). If the success-path default is inverted, this should flip to expect false, and it would be worth renaming or adding a case explicitly framed as an older cluster without plugins.calcite.enabled resolving to calcite disabled, so the backward-compat intent is captured in the test name.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — flipped 'uses the core client …' to expect calciteEnabled: false and added an explicitly-named case ('reports calciteEnabled:false for a cluster missing plugins.calcite.enabled') so the backward-compat intent is self-documenting. The catch-path tests ('swallows transport errors', 'logs auth failures at warn') keep asserting the fail-open true, which documents the success-vs-error asymmetry side by side.

* otherwise, or `null` when a dataSourceId is requested but the data source
* plugin is unavailable (the caller should respond 400 in that case).
*/
export async function resolveOpenSearchClient(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p1: add a direct multi-data-source test for resolveOpenSearchClient

This helper is the seam for the multi-data-source requirement in the parent RFC (sql#5405 section 2.9 plumbs dataSourceId end-to-end), but it is only tested indirectly through the two routes, and no test exercises two different dataSourceId values resolving to distinct clients. getClient is a single mock, so we currently prove that the id is forwarded, not that the right client comes back per id.

Suggested coverage in a dedicated unit test for this helper: ds-1 resolves to getClient('ds-1') and ds-2 resolves to getClient('ds-2') returning distinct clients, no id resolves to asCurrentUser, and an absent context.dataSource resolves to null. Cheap insurance given the rest of the lint feature builds on this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a describe('resolveOpenSearchClient') to index.test.ts covering: two distinct dataSourceIds resolve to distinct clients (not just id-forwarding), no id resolves to asCurrentUser, and a dataSourceId with context.dataSource absent resolves to null.

{
path: API.PPL_EXPLAIN,
validate: {
body: schema.object({ query: schema.string({ minLength: 1 }) }),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p1: confirm request guarding for a per-keystroke route

This proxies arbitrary PPL to _explain with only minLength: 1 on query. It is read-only and auth is enforced by the downstream client, so the risk is low, but per the RFC this route is hit on every keystroke (debounced). Worth confirming the debounce and abort logic lives client-side, and whether a minimal server-side guard (max query length) is wanted. Non-blocking, flagging for the follow-up that wires the editor.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed — the keystroke throttling lives client-side and lands with the editor-wiring follow-up PR, not this backend slice. Specifically: a 500 ms trailing-edge debounce per model, plus an ExplainCache that dedups in-flight requests and LRU-caches results per (dataSourceId, query), so repeated lint passes over the same text issue at most one _explain call. There's no explicit AbortController on the explain request — the debounce + dedup make a superseded response a harmless cache write, so cancellation wasn't needed (easy to add later if we want it). On the server side, I added the maxLength: 65536 guard on the query schema here in 4ac79ae as the minimal request bound you flagged, independent of the global server.maxPayload.

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

✅ All unit and integration tests passing

🔗 Workflow run · commit 031498a11bb4f1aaf9f2453da37371604ce4a572

}),
// PPL linter feature flag, read at runtime via DynamicConfigService and
// surfaced as the queryEnhancements.pplLint capability. Disabled by default.
pplLint: schema.object({

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be ppl.lint.enabled, or lintEnabled: ["ppl"]? i'm thinking if we add more languages in the future, can the config path be more structured?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — restructured to queryEnhancements.ppl.lint.enabled in 031498a (config schema now nests ppl: { lint: { enabled } }) so that future languages/features extend the same shape.


return res.ok({
body: {
calciteEnabled: resolveValue('plugins.calcite.enabled') !== 'false',

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Hanyu Wei added 2 commits June 22, 2026 12:36
Respond to ps48's review on opensearch-project#12255:

- calciteEnabled: invert `!== 'false'` to `=== 'true'` so a successful
  cluster-settings read with `plugins.calcite.enabled` absent reports the
  cluster as having no Calcite engine (disabled), instead of defaulting it
  on. include_defaults=true surfaces the key on any Calcite-capable
  cluster, so its absence is definitive. Document the deliberate asymmetry
  with the catch block, which keeps failing open (an error can't tell
  "no plugin" from a transient failure).
- Flip the matching test assertion and add an explicitly-named case for a
  cluster missing the key (no/old SQL plugin) to pin the backward-compat
  contract.
- Add a direct unit test block for resolveOpenSearchClient: distinct
  dataSourceIds resolve to distinct clients, no id resolves to
  asCurrentUser, and a dataSourceId with the data source plugin
  unavailable resolves to null.
- explain route: add `maxLength: 65536` to the query schema. The body was
  already bounded by server.maxPayload (1 MiB); this makes the cap
  explicit and independent of global config.

Signed-off-by: Hanyu Wei <weihanyu@amazon.com>
Address joshuali925's review on opensearch-project#12255: restructure the config path from
queryEnhancements.pplLint.enabled to queryEnhancements.ppl.lint.enabled so
future languages/features (ppl.autocomplete, sql.lint) extend the same
shape. The wire capability stays a flat boolean queryEnhancements.pplLint,
so capability consumers are unaffected; only the DynamicConfigService read
in the switcher changes (config.ppl?.lint?.enabled === true).

Signed-off-by: Hanyu Wei <weihanyu@amazon.com>
@Hanyu-W Hanyu-W requested review from joshuali925 and ps48 June 23, 2026 17:27

@ps48 ps48 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes

@github-actions

Copy link
Copy Markdown
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 10285e6.

PathLineSeverityDescription
src/plugins/query_enhancements/server/routes/ppl_calcite_settings.ts68mediumDeliberate fail-open on authentication failures (401/403): when the cluster-settings request is rejected for auth reasons, the route returns HTTP 200 with calciteEnabled:true rather than propagating the auth error. This means a permission failure silently enables the Calcite code path instead of blocking it, potentially allowing lint features to activate in environments where the operator explicitly denied access.
src/plugins/query_enhancements/server/routes/ppl_explain.ts44lowThe route accepts an arbitrary PPL query string (up to 64 KB) from the request body and forwards it verbatim to the internal OpenSearch /_plugins/_ppl/_explain endpoint. While protected by the user's own credentials via resolveOpenSearchClient, this proxy pattern could be used to probe internal OpenSearch cluster topology or exercise OpenSearch parser edge cases. No sanitization beyond length bounds is applied.

The table above displays the top 10 most important findings.

Total: 2 | Critical: 0 | High: 0 | Medium: 1 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Fail-open default may mislead clients

On any transport error (including network/transient failures and 401/403), the route returns { calciteEnabled: true, allJoinTypesAllowed: false }. This conflates "no Calcite plugin / cluster down / unauthorized" with "Calcite enabled," which can cause clients (lint rules) to apply Calcite-specific behavior on a cluster that does not support it. Consider returning a distinct error/unknown indicator (or at least differentiating 401/403 from other failures) so the client can decide rather than silently assuming enabled.

} catch (err) {
  const status = (err as { statusCode?: number; meta?: { statusCode?: number } })?.statusCode;
  const metaStatus = (err as { meta?: { statusCode?: number } })?.meta?.statusCode;
  const message = err instanceof Error ? err.message : String(err);
  // Fail open: a missing/failed cluster-settings read must not block the
  // editor. Calcite is assumed enabled (the engine default) so lint rules
  // still run. Surface auth/permission failures at warn so an operator can
  // see them; everything else stays at debug.
  if (status === 401 || status === 403 || metaStatus === 401 || metaStatus === 403) {
    logger.warn(`PPL calcite settings unauthorized (${status ?? metaStatus}): ${message}`);
  } else {
    logger.debug(`PPL calcite settings error: ${message}`);
  }
  return res.ok({ body: { calciteEnabled: true, allJoinTypesAllowed: false } });
}

@github-actions

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Security
Avoid leaking backend error details to clients

Forwarding the raw error message directly to the client may leak internal/backend
details (stack traces, cluster paths, indices). Consider returning a generic message
to the client while logging the detailed message server-side, especially for 5xx
errors.

src/plugins/query_enhancements/server/routes/ppl_explain.ts [56-61]

 const message = e.message ?? 'Failed to explain PPL query';
+const statusCode = coerceStatusCode(e.status ?? e.statusCode ?? e.meta?.statusCode);
 logger.debug(`PPL explain error: ${message}`);
 return res.custom({
-  statusCode: coerceStatusCode(e.status ?? e.statusCode ?? e.meta?.statusCode),
-  body: message,
+  statusCode,
+  body: statusCode >= 500 ? 'Failed to explain PPL query' : message,
 });
Suggestion importance[1-10]: 5

__

Why: Reasonable security-hygiene suggestion to avoid leaking backend error details on 5xx responses, though the existing definePPLBundleRoute follows the same pattern, so impact is moderate.

Low
General
Make boolean string comparison case-insensitive

The String(raw) normalization is applied for typed booleans, but comparing against
the lowercase string 'true' will not match String(true) results correctly only if
values arrive capitalized. More importantly, for the false boolean case,
String(false) produces 'false' which correctly fails the === 'true' check, but
consider also accepting case-insensitive matches to be robust against different
transport serializations.

src/plugins/query_enhancements/server/routes/ppl_calcite_settings.ts [58-59]

-calciteEnabled: resolveValue('plugins.calcite.enabled') === 'true',
-allJoinTypesAllowed: resolveValue('plugins.calcite.all_join_types.allowed') === 'true',
+calciteEnabled: resolveValue('plugins.calcite.enabled')?.toLowerCase() === 'true',
+allJoinTypesAllowed: resolveValue('plugins.calcite.all_join_types.allowed')?.toLowerCase() === 'true',
Suggestion importance[1-10]: 2

__

Why: OpenSearch cluster settings return lowercase 'true'/'false' strings, and String(true) also produces lowercase 'true', so case-insensitive matching is unnecessary. Marginal defensive improvement.

Low

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants