diff --git a/_toolbox/PR_DESCRIPTION.md b/_toolbox/PR_DESCRIPTION.md index c274d458..6d3ee776 100644 --- a/_toolbox/PR_DESCRIPTION.md +++ b/_toolbox/PR_DESCRIPTION.md @@ -72,7 +72,7 @@ I tried contributing some of these fixes back to `microsoft/dbt-fabric` first, b - [#01 — Six `__exit__` methods return `True` (silent exception swallowing)](https://github.com/sdebruyn/dbt-fabric/blob/to-toolbox/_toolbox/upstream_issues/dbt-fabricspark/01-exit-methods-return-true-swallow-exceptions.md) - [microsoft/dbt-fabricspark#192 — Hardcoded `expires_on = 1845972874` (year 2028) in `int_tests` auth path bypasses all token refresh](https://github.com/microsoft/dbt-fabricspark/issues/192) -- [#04 — `_getLivySQL`: `re.DOTALL` passed as positional `count` caps comment-stripping at 16](https://github.com/sdebruyn/dbt-fabric/blob/to-toolbox/_toolbox/upstream_issues/dbt-fabricspark/04-getlivysql-regex-bug.md) +- [microsoft/dbt-fabricspark#194 — `_getLivySQL`: `re.DOTALL` passed as positional `count` caps comment-stripping at 16](https://github.com/microsoft/dbt-fabricspark/issues/194) (draft fix: [microsoft/dbt-fabricspark#195](https://github.com/microsoft/dbt-fabricspark/pull/195)) - [#06 — Livy session cleanup bypasses dbt's `close()` lifecycle and uses `atexit`](https://github.com/sdebruyn/dbt-fabric/blob/to-toolbox/_toolbox/upstream_issues/dbt-fabricspark/06-atexit-handlers-leak-livy-sessions.md) - [#07 — Dead Thrift exception handler from dbt-spark ancestry](https://github.com/sdebruyn/dbt-fabric/blob/to-toolbox/_toolbox/upstream_issues/dbt-fabricspark/07-thrift-dead-code.md) - [#08 — Proposal: inherit from `dbt-spark` instead of being a standalone `SQLAdapter`](https://github.com/sdebruyn/dbt-fabric/blob/to-toolbox/_toolbox/upstream_issues/dbt-fabricspark/08-inherit-from-dbt-spark.md) diff --git a/_toolbox/upstream_issues/INDEX.md b/_toolbox/upstream_issues/INDEX.md index 5f25da05..c1edc220 100644 --- a/_toolbox/upstream_issues/INDEX.md +++ b/_toolbox/upstream_issues/INDEX.md @@ -43,7 +43,7 @@ Each issue body carries a `> [ ] Validated by maintainer` checkbox at the top. S - [x] 01 — Six `__exit__` methods return `True` - [x] 02 — Hardcoded 2028 token expiry → filed as [microsoft/dbt-fabricspark#192](https://github.com/microsoft/dbt-fabricspark/issues/192) -- [x] 04 — `_getLivySQL` regex bug +- [x] 04 — `_getLivySQL` regex bug → filed as [microsoft/dbt-fabricspark#194](https://github.com/microsoft/dbt-fabricspark/issues/194), draft PR [microsoft/dbt-fabricspark#195](https://github.com/microsoft/dbt-fabricspark/pull/195) - [x] 06 — Livy cleanup bypasses dbt's `close()` lifecycle - [x] 07 — Thrift dead code from dbt-spark ancestry - [x] 08 — Proposal: inherit from `dbt-spark` @@ -111,7 +111,7 @@ These are debatable and may invite long discussion. File only if the second batc |---|---|---|---| | 01 | Six `__exit__` methods return `True` — silent exception swallowing | high | bug | | 02 | [Hardcoded 2028 token expiry bypasses refresh logic](https://github.com/microsoft/dbt-fabricspark/issues/192) | high | bug, security | -| 04 | `_getLivySQL` regex bug: `re.DOTALL` passed as positional `count` | medium | bug | +| 04 | [`_getLivySQL` regex bug: `re.DOTALL` passed as positional `count`](https://github.com/microsoft/dbt-fabricspark/issues/194) | medium | bug | | 06 | `atexit` handlers leak Livy sessions on hard kill / OOM | medium | bug | | 07 | Thrift dead code from dbt-spark ancestry | low | tech-debt | | 09 | `botocore`/`boto3` DEBUG logging at import time | low | tech-debt | diff --git a/_toolbox/upstream_issues/dbt-fabricspark/04-getlivysql-regex-bug.md b/_toolbox/upstream_issues/dbt-fabricspark/04-getlivysql-regex-bug.md deleted file mode 100644 index e1827da9..00000000 --- a/_toolbox/upstream_issues/dbt-fabricspark/04-getlivysql-regex-bug.md +++ /dev/null @@ -1,45 +0,0 @@ -# `_getLivySQL`: `re.DOTALL` passed as positional `count` arg — comment-stripping silently capped at 16 replacements per file - -**Repo:** `microsoft/dbt-fabricspark` -**Labels (suggested):** `bug`, `priority/medium` - -> [x] **Validated by maintainer** — code refs, line numbers, and claims confirmed against upstream HEAD - -> **Internal note (strip before filing):** Submittable as a PR — swap `re.DOTALL` from positional to `flags=re.DOTALL` in both copies; extracting a shared `_strip_sql_comments(sql)` helper is a small bonus to keep the bug from reoccurring. Consider opening with the issue *and* a draft PR linked from it. - -## Summary - -`_getLivySQL` passes `re.DOTALL` as the positional `count` argument to `re.sub` instead of as `flags=re.DOTALL`. `re.DOTALL` is an integer enum value (16). `re.sub`'s positional signature is `re.sub(pattern, repl, string, count=0, flags=0)`. Passing `re.DOTALL` positionally sets `count=16`, capping comment-stripping to at most 16 replacements per file. - -## Evidence (HEAD [`d315a56`](https://github.com/microsoft/dbt-fabricspark/tree/d315a56)) - -`_getLivySQL` is duplicated verbatim between [`src/dbt/adapters/fabricspark/singleton_livy.py#L488`](https://github.com/microsoft/dbt-fabricspark/blob/d315a56/src/dbt/adapters/fabricspark/singleton_livy.py#L488) and [`src/dbt/adapters/fabricspark/concurrent_livy.py#L555`](https://github.com/microsoft/dbt-fabricspark/blob/d315a56/src/dbt/adapters/fabricspark/concurrent_livy.py#L555). Both copies call `re.sub(pattern, '', sql, re.DOTALL)` (positional) when they meant `re.sub(pattern, '', sql, flags=re.DOTALL)`. - -## User impact - -- Model SQL files with more than 16 `/* */` comment blocks have some comments left in the SQL submitted to Spark. -- Most of the time Spark's parser shrugs them off, but commented-out SQL can introduce surprises — particularly when the leftover comment contains characters that interact with the Livy/Spark parser, or when the comment contains SQL keywords that the parser must lexically skip past. -- Without `re.DOTALL` set, comment blocks containing newlines are not matched at all by the multi-line comment pattern — so the bug is doubly broken: count is set to 16, and the DOTALL flag that the code intended to set is not set. - -## Reproduction - -```python -import re -pattern = r"/\*.*?\*/" -sql = ("/* a */\n" * 20) + "select 1" -# Buggy call: -result_buggy = re.sub(pattern, "", sql, re.DOTALL) -# count=16 is set; flags=0 (no DOTALL); .*? does not cross newlines anyway -# but the count cap kicks in regardless. -``` - -## Suggested fix - -Use keyword args: - -```python -sql = re.sub(pattern, "", sql, flags=re.DOTALL) -``` - -Fix both copies ([`singleton_livy.py#L488`](https://github.com/microsoft/dbt-fabricspark/blob/d315a56/src/dbt/adapters/fabricspark/singleton_livy.py#L488) and [`concurrent_livy.py#L555`](https://github.com/microsoft/dbt-fabricspark/blob/d315a56/src/dbt/adapters/fabricspark/concurrent_livy.py#L555)). Better: extract a single `_strip_sql_comments(sql)` helper and call it from both files so the bug can't reoccur in only one place. -