Skip to content

feat(data-warehouse): add multi-schema support for Postgres sources#58694

Open
RishiiGamer2201 wants to merge 4 commits into
PostHog:masterfrom
RishiiGamer2201:feature/multi-schema-postgres-support
Open

feat(data-warehouse): add multi-schema support for Postgres sources#58694
RishiiGamer2201 wants to merge 4 commits into
PostHog:masterfrom
RishiiGamer2201:feature/multi-schema-postgres-support

Conversation

@RishiiGamer2201
Copy link
Copy Markdown

Summary

Adds support for discovering and syncing tables from multiple schemas in PostgreSQL databases.

Changes

  • Backend (posthog/temporal/data_imports/sources/):

    • Added include_all_schemas: bool field to PostgresSourceConfig
    • Added new SourceFieldSwitchConfig schema type for toggle UI
    • Updated get_schemas() to fetch all non-system schemas when flag is enabled
    • Updated validation to allow skipping schema when "include all schemas" is enabled
  • Frontend (products/data_warehouse/frontend/):

    • Added switch type handling in SourceForm.tsx to render the toggle

How it works

When users enable the "Include all schemas" toggle in the PostgreSQL source configuration, the backend passes None to the schema parameter, which fetches all non-system schemas (existing behavior). When disabled, it works as before with a single schema.

Testing

To test:

  1. Set up PostHog development environment
  2. Navigate to Data Warehouse → Add Source → PostgreSQL
  3. Enable "Include all schemas" toggle
  4. Verify tables from multiple schemas are discovered

Closes #58643

- Add include_all_schemas field to PostgresSourceConfig
- Add SourceFieldSwitchConfig for toggle UI in frontend
- Update backend to use all schemas when flag is enabled
- Update validation to allow no schema when include_all_schemas is true

Closes PostHog#58643
Copilot AI review requested due to automatic review settings May 17, 2026 11:00
@assign-reviewers-posthog assign-reviewers-posthog Bot requested review from a team May 17, 2026 11:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new "switch" field type for source configs and uses it in the Postgres source to expose an "Include all schemas" toggle. When enabled, schema discovery scans all non-system schemas.

Changes:

  • New SourceFieldSwitchConfig schema/type and frontend renderer using LemonSwitch.
  • Postgres source gains an include_all_schemas boolean config that bypasses the single-schema filter during discovery.
  • Credential validation relaxed to allow missing schema when include_all_schemas is true.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
posthog/schema.py Adds SourceFieldSwitchConfig pydantic model.
posthog/temporal/data_imports/sources/common/base.py Includes switch config in source field type union.
posthog/temporal/data_imports/sources/generated_configs.py Adds include_all_schemas to PostgresSourceConfig.
posthog/temporal/data_imports/sources/postgres/source.py Wires up new switch field and applies it to schema discovery and validation.
products/.../forms/SourceForm.tsx Renders the new switch field type via LemonField + LemonSwitch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

>
{({ value, onChange }) => (
<LemonSwitch
checked={value || lastValue?.[field.name] || false}
Comment on lines +426 to +427
if not schema and not schema_name and not config.include_all_schemas:
return False, "Schema is required for warehouse imports unless 'Include all schemas' is enabled."
Comment on lines +224 to +225
effective_schema = None if config.include_all_schemas else config.schema

Comment thread posthog/schema.py
)
label: str
name: str
type: Literal["switch"] = "switch"
Comment on lines +137 to +141
SourceFieldSwitchConfig(
name="include_all_schemas",
label="Include all schemas",
caption="Enable to discover and sync tables from all non-system schemas in the database",
),
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 17, 2026

Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
posthog/temporal/data_imports/sources/generated_configs.py:631
Using raw `bool` as the converter is incorrect here. Every other boolean config field in this file uses `config.str_to_bool`. The problem is that `bool("false")` returns `True` — any non-empty string is truthy in Python — so if the config value arrives as the string `"false"`, the flag will be incorrectly treated as enabled and all schemas will always be fetched.

```suggestion
    include_all_schemas: bool = config.value(converter=config.str_to_bool, default=False)
```

### Issue 2 of 3
products/data_warehouse/frontend/shared/components/forms/SourceForm.tsx:286-289
The `||` fallback is incorrect for a boolean switch. When the user explicitly turns the toggle **off** (`value === false`), the expression `value || lastValue?.[field.name]` is falsy and falls through to `lastValue`, which may be `true` — causing the UI to show the toggle as on even though the user just switched it off. Use `??` (nullish coalescing) so that only `undefined` / `null` trigger the fallback, not a legitimate `false`.

```suggestion
                    <LemonSwitch
                        checked={value ?? lastValue?.[field.name] ?? false}
                        onChange={(checked) => onChange(checked)}
                    />
```

### Issue 3 of 3
posthog/schema.py:4624-4633
`posthog/schema.py` is autogenerated by `pnpm run schema:build` and must not be modified by hand. Changes made directly here will be overwritten the next time the schema is regenerated. The `SourceFieldSwitchConfig` class should instead be defined in the source schema (e.g. `frontend/src/queries/schema/`) and regenerated via the build command.

Reviews (1): Last reviewed commit: "feat(data-warehouse): add multi-schema s..." | Re-trigger Greptile

Comment thread posthog/temporal/data_imports/sources/generated_configs.py Outdated
Comment thread posthog/schema.py
Comment on lines +4624 to +4633
class SourceFieldSwitchConfig(BaseModel):
model_config = ConfigDict(
extra="forbid",
)
label: str
name: str
type: Literal["switch"] = "switch"
caption: str | None = None


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 posthog/schema.py is autogenerated by pnpm run schema:build and must not be modified by hand. Changes made directly here will be overwritten the next time the schema is regenerated. The SourceFieldSwitchConfig class should instead be defined in the source schema (e.g. frontend/src/queries/schema/) and regenerated via the build command.

Rule Used: Do not manually modify the posthog/schema.py fil... (source)

Learned From
PostHog/posthog#32620

Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/schema.py
Line: 4624-4633

Comment:
`posthog/schema.py` is autogenerated by `pnpm run schema:build` and must not be modified by hand. Changes made directly here will be overwritten the next time the schema is regenerated. The `SourceFieldSwitchConfig` class should instead be defined in the source schema (e.g. `frontend/src/queries/schema/`) and regenerated via the build command.

**Rule Used:** Do not manually modify the `posthog/schema.py` fil... ([source](https://app.greptile.com/review/custom-context?memory=192de143-59d7-412f-bd71-fe5cb8f161dd))

**Learned From**
[PostHog/posthog#32620](https://github.com/PostHog/posthog/pull/32620)

How can I resolve this? If you propose a fix, please make it concise.

- Use config.str_to_bool instead of raw bool converter
- Use nullish coalescing (??) instead of || for boolean fallback
- Add SourceFieldSwitchConfig to TypeScript schema
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 17, 2026

Comments Outside Diff (1)

  1. posthog/temporal/data_imports/sources/postgres/source.py, line 516-518 (link)

    P1 source_for_pipeline ignores include_all_schemas when config.schema is non-empty

    config.schema or source_schema or "public" means that if a user previously set a schema (e.g. "finance") and then enables include_all_schemas, config.schema is still truthy and will override the per-table source_schema stored in metadata. So a table discovered in "analytics" would be synced as "finance"."<table>" — causing a table-not-found error or fetching wrong data. When include_all_schemas=True, the authoritative schema for each table is source_schema (from metadata) and config.schema should not take precedence over it.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: posthog/temporal/data_imports/sources/postgres/source.py
    Line: 516-518
    
    Comment:
    **`source_for_pipeline` ignores `include_all_schemas` when `config.schema` is non-empty**
    
    `config.schema or source_schema or "public"` means that if a user previously set a schema (e.g. `"finance"`) and then enables `include_all_schemas`, `config.schema` is still truthy and will override the per-table `source_schema` stored in metadata. So a table discovered in `"analytics"` would be synced as `"finance"."<table>"` — causing a table-not-found error or fetching wrong data. When `include_all_schemas=True`, the authoritative schema for each table is `source_schema` (from metadata) and `config.schema` should not take precedence over it.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
posthog/temporal/data_imports/sources/postgres/source.py:516-518
**`source_for_pipeline` ignores `include_all_schemas` when `config.schema` is non-empty**

`config.schema or source_schema or "public"` means that if a user previously set a schema (e.g. `"finance"`) and then enables `include_all_schemas`, `config.schema` is still truthy and will override the per-table `source_schema` stored in metadata. So a table discovered in `"analytics"` would be synced as `"finance"."<table>"` — causing a table-not-found error or fetching wrong data. When `include_all_schemas=True`, the authoritative schema for each table is `source_schema` (from metadata) and `config.schema` should not take precedence over it.

### Issue 2 of 2
pr-body.md:1
**Development artifacts accidentally committed**

`pr-body.md` and `issue-comment.md` in the repository root are draft text files used to compose the GitHub PR description and issue comment — they are not source code and should not be committed. Please remove both files.

Reviews (2): Last reviewed commit: "fix: address bot review comments" | Re-trigger Greptile

Comment thread pr-body.md Outdated
- Use source_schema when include_all_schemas is enabled in source_for_pipeline
- Remove accidentally committed draft files (pr-body.md, issue-comment.md)
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 17, 2026

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
posthog/temporal/data_imports/sources/postgres/source.py:519
**`schema=None` passed to `postgres_source()` when metadata is incomplete**

When `include_all_schemas=True` but `source_schema` is `None` (e.g., an existing `ExternalDataSchema` that was created before this feature was enabled, on a source where a user later toggles the flag on), `schema=source_schema` evaluates to `None`. `postgres_source()` is typed `schema: str` and passes the value directly into `_get_table(cursor, schema, ...)`, which builds SQL via `sql.Identifier(schema)` — passing `None` there will raise a runtime error.

Unlike the `include_all_schemas=False` branch, there is no `or "public"` fallback, so tables with missing `source_schema` metadata silently crash instead of degrading gracefully. Consider `(source_schema or "public") if config.include_all_schemas else (config.schema or source_schema or "public")` as a minimal guard.

Reviews (3): Last reviewed commit: "fix: address remaining bot review commen..." | Re-trigger Greptile

Comment thread posthog/temporal/data_imports/sources/postgres/source.py Outdated
When include_all_schemas=True but source_schema is None (legacy metadata),
fall back to 'public' to avoid passing None to postgres_source().
@RishiiGamer2201 RishiiGamer2201 requested a review from Copilot May 17, 2026 11:21
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 17, 2026

Reviews (4): Last reviewed commit: "fix: add fallback to public when source_..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Comment on lines +282 to +287
label={field.label}
help={field.caption ? <LemonMarkdown className="text-xs">{field.caption}</LemonMarkdown> : undefined}
>
{({ value, onChange }) => (
<LemonSwitch
checked={value ?? lastValue?.[field.name] ?? false}
Comment on lines +516 to +520
# When include_all_schemas is enabled, use source_schema from metadata (per-table).
# Fall back to "public" if source_schema is missing (legacy metadata).
# Otherwise config.schema wins so warehouse-mode renames flow through without
# rewriting schema_metadata. Falls back to source_schema for direct mode.
schema=(source_schema or "public") if config.include_all_schemas else (config.schema or source_schema or "public"),
@@ -415,8 +423,8 @@ def validate_credentials_for_access_method(
) -> tuple[bool, str | None]:
if access_method != "direct":
schema = config.schema.strip() if isinstance(config.schema, str) else ""
port: int = config.value(converter=int)
connection_string: str | None = None
schema: str | None = None
include_all_schemas: bool = config.value(converter=config.str_to_bool, default=False)
Comment on lines +137 to +141
SourceFieldSwitchConfig(
name="include_all_schemas",
label="Include all schemas",
caption="Enable to discover and sync tables from all non-system schemas in the database",
),
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-Schema Postgres DB

2 participants