Skip to content

Feat/schema learning in process#377

Merged
tonyalaribe merged 7 commits into
masterfrom
feat/schema-learning-in-process
May 10, 2026
Merged

Feat/schema learning in process#377
tonyalaribe merged 7 commits into
masterfrom
feat/schema-learning-in-process

Conversation

@tonyalaribe

Copy link
Copy Markdown
Contributor

Closes #

How to test

Checklist

  • Make sure you have described your changes and added all relevant screenshots or data.
  • Make sure your changes are tested (stories and/or unit, integration, or end-to-end tests).
  • Make sure to add/update documentation regarding your changes (or request one from the team).
  • You are NOT deprecating/removing a feature.

tonyalaribe and others added 7 commits May 10, 2026 05:44
…arning

Two-tier catalog: instance-wide schema_template (structure-only, dedup'd
by hash) + per-project schema_catalog (values, counts, anomaly state) +
schema_summary (materialized AI/query-editor doc). Pipeline lives in
Pkg.SchemaLearning (Hot/Catalog/Worker/OpenApi); per-flush diffing in
Worker.flushDirty replaces the legacy DB triggers.

Drops apis.shapes / apis.fields / apis.formats / apis.facet_summaries
and their anomaly triggers (migrations 0089 + 0090).
…ields/formats

projectCacheById joined apis.shapes on every cache rebuild and filtered
on sh.hash IS NOT NULL, so once 0090 dropped the table the cache for
every project came back empty and the request path 500'd.

Removes the join (and the now-unused shapeHashes field on ProjectCache)
and strips the legacy shapes/fields/formats migration + delete steps
from migrateAndDeleteMergedEndpoints — the schema-learning catalog
re-derives structure per canonical key, so no row migration is needed.
projects.redacted_fields.field_category was dropped via CASCADE in 0090
(the apis.field_category enum it depended on was removed). The hot-path
projectCache rebuild query still referenced it, throwing
"column rf.field_category does not exist" on every ingestion batch and
killing prod again. Replaces it with an empty middle segment so the
'<>'-separated string format stays compatible with any downstream split.
@tonyalaribe tonyalaribe merged commit 8ec90d2 into master May 10, 2026
6 of 7 checks passed
@tonyalaribe tonyalaribe deleted the feat/schema-learning-in-process branch May 10, 2026 12:24
@claude

claude Bot commented May 10, 2026

Copy link
Copy Markdown

Code Review

What this PR does

Fixes a broken SQL query in projectCacheById caused by migration 0090_drop_legacy_schema_tables.sql, which drops apis.field_category via DROP TYPE … CASCADE. That cascade also drops the field_category column from projects.redacted_fields, making the old rf.field_category reference invalid. The fix hardcodes an empty second segment so the three-part <>-delimited format is preserved.


Issues

1. Outdated struct/comment — Projects.hs:334-335

The doc comment still says:

-- [endpointHash]<>[field_category eg requestBody]<>[field_key_path]
-- Those redact fields that don't have endpoint or field_category attached, would be applied to every endpoint and field category.

field_category no longer exists. The comment should be updated to reflect that the middle segment is always empty now — otherwise future readers will think the format is still meaningful.

2. Overly verbose SQL comment

The added SQL comment spans three lines explaining the migration context. Given the project's preference for conciseness, a single line is sufficient:

-- field_category col dropped by 0090 CASCADE; middle segment is always empty now.

3. concat call can be simplified

concat(rf.endpoint_hash,'<>','<>', rf.path)

concat with three literal string arguments is slightly indirect. The intent is clearer as:

concat(rf.endpoint_hash, '<><>', rf.path)

Both are equivalent (PostgreSQL concat is NULL-safe for the hash), but the two-arg form is more readable.

4. No test/CI coverage for the affected query

The PR checklist has all items unchecked, and there's no test exercising projectCacheById against a schema that has field_category dropped. Given this was a silent runtime crash before the fix, a regression test here would prevent recurrence.


Non-issues

  • The NULL-safety of concat vs. || is correctly preserved — rf.endpoint_hash can be NULL (global redactions), and concat treats NULL as empty string, which is the desired behaviour.
  • Preserving the three-segment format is correct: downstream consumers (redactJSON + any split-on-<> callers) won't need changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant