Adding python schema check#50
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds SchemaCheck: a Liquibase custom check implemented in Python that parses generated SQL to detect unqualified object references, integrates with Liquibase to evaluate changesets and fail on violations, and documents the check in the repository README. ChangesSchemaCheck Custom Check Implementation
sequenceDiagram
participant Liquibase
participant SchemaCheckScript
participant ValidationLogic
Liquibase->>SchemaCheckScript: invoke custom check
SchemaCheckScript->>SchemaCheckScript: fetch current changeset
loop per change
SchemaCheckScript->>SchemaCheckScript: skip if loaddatachange
SchemaCheckScript->>SchemaCheckScript: generate raw SQL
SchemaCheckScript->>ValidationLogic: check_schema_qualification(sql)
ValidationLogic-->>SchemaCheckScript: violations list
alt violation found
SchemaCheckScript->>Liquibase: update status / log warning
SchemaCheckScript->>SchemaCheckScript: exit(1)
else no violation
SchemaCheckScript->>SchemaCheckScript: continue
end
end
SchemaCheckScript->>Liquibase: return False
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
Python/Scripts/Any/schema_check.py (1)
59-62: ⚡ Quick win
IdentifierListnot handled — comma-join syntax produces silent false negatives.
FROM t1, t2gives anIdentifierListas the next token. It is not anIdentifier, so theisinstancebranch misses it; itsttypeisNone, so theelifwraps the whole list into a syntheticIdentifier([IdentifierList])whoseget_real_name()is unreliable. Both table references are silently skipped.♻️ Proposed fix — handle `IdentifierList`
+from sqlparse.sql import Identifier, IdentifierList, Parenthesis ... if isinstance(next_token, Identifier): + first = next_token.token_first(skip_ws=True, skip_cm=True) + if isinstance(first, Parenthesis): + continue identifiers.append(next_token) + elif isinstance(next_token, IdentifierList): + for ident in next_token.get_identifiers(): + if isinstance(ident, Identifier): + identifiers.append(ident) elif next_token.ttype in (Name, None):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Python/Scripts/Any/schema_check.py` around lines 59 - 62, The code currently treats tokens with ttype None as Identifiers and wraps IdentifierList objects into a synthetic Identifier, causing multi-table comma syntax to be missed; update the token handling in schema_check.py around the next_token logic to explicitly check for isinstance(next_token, IdentifierList) and, when true, iterate its .get_identifiers() (or children) to extend the identifiers list with each real Identifier (so their get_real_name() works), rather than appending Identifier([next_token]); keep the existing Identifier branch for single Identifiers and preserve the identifiers list usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Python/Scripts/Any/schema_check.py`:
- Around line 76-77: The current check skips CTEs because the stmt_str regex
only accepts statements starting with SELECT/INSERT/...; update that regex to
include WITH so CTE-led statements are inspected, and implement CTE alias
tracking: when processing stmt_str, parse leading CTE definitions (e.g., match
"WITH\s+([^(\s,]+)\s+AS" repeatedly) to collect alias names, then when your
unqualified-table detection logic (the routine that scans FROM/JOIN table
identifiers in the same scope as stmt_str) runs, exclude any table references
that match the collected CTE aliases to avoid false positives; refer to the
existing re.match check on stmt_str and the variable holding the statement text
(stmt_str) when adding the CTE parsing and exclusion logic.
- Around line 111-116: The code only reports violations[0] then exits; change it
to report all violations returned by check_schema_qualification by joining or
iterating over the violations list and including every unqualified object in the
log and the liquibase_status.message before exiting; update the block that
references violations, liquibase_logger, and liquibase_status (the same scope
where violations is set by check_schema_qualification) so the warning message
contains all items (e.g., comma-separated or multiline) and still sets
liquibase_status.fired = True and sys.exit(1).
- Around line 57-58: The unqualified-alias checker is flagging subquery-derived
aliases because sqlparse wraps "FROM (SELECT ...) AS t" as an Identifier whose
first meaningful child is a Parenthesis; modify the handling where next_token is
an Identifier to detect the identifier's first meaningful child (e.g., inspect
identifier.tokens or use identifier.get_real_name()/first_token) and skip
further qualification checks if that child is a Parenthesis (i.e., it's a
subquery-derived table), so the existing get_parent_name() / alias dot checks
don't run for subquery aliases like "t".
---
Nitpick comments:
In `@Python/Scripts/Any/schema_check.py`:
- Around line 59-62: The code currently treats tokens with ttype None as
Identifiers and wraps IdentifierList objects into a synthetic Identifier,
causing multi-table comma syntax to be missed; update the token handling in
schema_check.py around the next_token logic to explicitly check for
isinstance(next_token, IdentifierList) and, when true, iterate its
.get_identifiers() (or children) to extend the identifiers list with each real
Identifier (so their get_real_name() works), rather than appending
Identifier([next_token]); keep the existing Identifier branch for single
Identifiers and preserve the identifiers list usage.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 567e64cd-e72c-4f74-a783-2750e473d2d7
📒 Files selected for processing (2)
Python/Scripts/Any/README.mdPython/Scripts/Any/schema_check.py
| if not re.match(r"^\s*(SELECT|INSERT|UPDATE|DELETE|EXEC|EXECUTE|CREATE|ALTER)\b", stmt_str, re.IGNORECASE): | ||
| continue |
There was a problem hiding this comment.
CTE queries (WITH ... AS (...)) are silently skipped — no schema check applied.
The regex requires statements to start with SELECT/INSERT/etc., so any WITH-prefixed CTE entirely bypasses the check. Unqualified table references inside the CTE body (e.g., WITH cte AS (SELECT * FROM UnqualifiedTable)) go unreported.
Adding WITH to the regex is trivial; the harder part is tracking the CTE alias names (cte, etc.) and excluding them from violation detection in the outer query's FROM cte reference to avoid introducing false positives.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Python/Scripts/Any/schema_check.py` around lines 76 - 77, The current check
skips CTEs because the stmt_str regex only accepts statements starting with
SELECT/INSERT/...; update that regex to include WITH so CTE-led statements are
inspected, and implement CTE alias tracking: when processing stmt_str, parse
leading CTE definitions (e.g., match "WITH\s+([^(\s,]+)\s+AS" repeatedly) to
collect alias names, then when your unqualified-table detection logic (the
routine that scans FROM/JOIN table identifiers in the same scope as stmt_str)
runs, exclude any table references that match the collected CTE aliases to avoid
false positives; refer to the existing re.match check on stmt_str and the
variable holding the statement text (stmt_str) when adding the CTE parsing and
exclusion logic.
| if violations: | ||
| msg = f"Missing schema for object '{violations[0]}' in SQL statement." | ||
| liquibase_logger.warning(msg) | ||
| liquibase_status.fired = True | ||
| liquibase_status.message = msg | ||
| sys.exit(1) |
There was a problem hiding this comment.
Only the first violation per changeset is reported; remaining violations are silently dropped.
Even though check_schema_qualification returns all unqualified names, the outer loop breaks on the first violating change using violations[0]. A developer with three unqualified tables must fix-and-re-run three times to discover all of them.
💡 Proposed improvement — report all violations before exiting
- if violations:
- msg = f"Missing schema for object '{violations[0]}' in SQL statement."
- liquibase_logger.warning(msg)
- liquibase_status.fired = True
- liquibase_status.message = msg
- sys.exit(1)
+ for v in violations:
+ liquibase_logger.warning(f"Missing schema for object '{v}' in SQL statement.")
+ if violations:
+ msg = f"Missing schema qualification: {', '.join(violations)}"
+ liquibase_status.fired = True
+ liquibase_status.message = msg
+ sys.exit(1)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Python/Scripts/Any/schema_check.py` around lines 111 - 116, The code only
reports violations[0] then exits; change it to report all violations returned by
check_schema_qualification by joining or iterating over the violations list and
including every unqualified object in the log and the liquibase_status.message
before exiting; update the block that references violations, liquibase_logger,
and liquibase_status (the same scope where violations is set by
check_schema_qualification) so the warning message contains all items (e.g.,
comma-separated or multiline) and still sets liquibase_status.fired = True and
sys.exit(1).
Updated schema_check.py as it was not capturing eg.
UPDATE DirContact SET DirCompanyId = @PodId…