Skip to content

Conversation

@BesikiML
Copy link
Contributor

@BesikiML BesikiML commented Jan 7, 2026

🐛 Problem
Reconciliation fails on Databricks serverless compute with:
[NOT_SUPPORTED_WITH_SERVERLESS] PERSIST TABLE is not supported on serverless compute
🔍 Root Cause
The reconciliation process uses .cache() for performance optimization, but serverless compute does not support DataFrame caching operations.
✅ Solution
Implemented serverless detection and conditional caching strategy:
Changes Made

  1. Added Serverless Detection Method
    New _is_serverless() method checks for clusterNodeType config
    Classic clusters: config exists → returns False
    Serverless: config throws CONFIG_NOT_AVAILABLE → returns True
  2. Conditional Caching Logic
    Classic clusters: Uses .cache() for performance (existing behavior)
    Serverless: Skips caching to avoid runtime errors
    Technical Details
    Detection Method:
    node_type = self._spark.conf.get("spark.databricks.clusterUsageTags.clusterNodeType")
    ✅ Classic: Returns node type (e.g., i3.2xlarge)
    ❌ Serverless: Throws AnalysisException with CONFIG_NOT_AVAILABLE

Fixed issue: #1438

Tests

  • manually tested
  • added unit tests
  • added integration tests

Use Unity Catalog volumes instead of .cache() for serverless. Auto-detects compute type.
Fixes: [NOT_SUPPORTED_WITH_SERVERLESS]
@BesikiML BesikiML requested a review from m-abulazm January 7, 2026 03:16
@BesikiML BesikiML requested a review from a team as a code owner January 7, 2026 03:16
@BesikiML BesikiML linked an issue Jan 7, 2026 that may be closed by this pull request
1 task
@BesikiML BesikiML self-assigned this Jan 7, 2026
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

✅ 130/130 passed, 8 flaky, 5 skipped, 9m40s total

Flaky tests:

  • 🤪 test_installs_and_runs_local_bladebridge (20.228s)
  • 🤪 test_installs_and_runs_pypi_bladebridge (25.318s)
  • 🤪 test_transpiles_informatica_to_sparksql_non_interactive[True] (16.263s)
  • 🤪 test_transpiles_informatica_to_sparksql (16.575s)
  • 🤪 test_transpile_teradata_sql (18.15s)
  • 🤪 test_transpiles_informatica_to_sparksql_non_interactive[False] (3.719s)
  • 🤪 test_transpile_teradata_sql_non_interactive[False] (5.424s)
  • 🤪 test_transpile_teradata_sql_non_interactive[True] (13.173s)

Running from acceptance #3493

@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 0% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.78%. Comparing base (d212ba7) to head (df8ce7b).

Files with missing lines Patch % Lines
...abricks/labs/lakebridge/reconcile/recon_capture.py 0.00% 21 Missing ⚠️
...bricks/labs/lakebridge/reconcile/reconciliation.py 0.00% 3 Missing ⚠️
...rc/databricks/labs/lakebridge/reconcile/compare.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2218      +/-   ##
==========================================
- Coverage   63.95%   63.78%   -0.17%     
==========================================
  Files          99       99              
  Lines        8644     8666      +22     
  Branches      890      893       +3     
==========================================
  Hits         5528     5528              
- Misses       2944     2966      +22     
  Partials      172      172              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Use specific exception types instead of broad Exception catch
to satisfy CI linter rules. Add cluster ID check for improved detection.
Avoids CONFIG_NOT_AVAILABLE exceptions by fetching all configs at once.
Passes all linter checks.
distinguishes serverless (CONFIG_NOT_AVAILABLE) from classic
clusters.
Copy link
Contributor

@m-abulazm m-abulazm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job. we should always add type annotations to any new code especially to a public method. once this comment is addressed we can merge this down

raise ReadAndWriteWithVolumeException(message) from e


def classify_spark_runtime(spark):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add type annotation

Suggested change
def classify_spark_runtime(spark):
SparkRuntimeType = Literal["DATABRICKS_SERVERLESS", "CLASSIC" , .....]
def classify_spark_runtime(spark: SparkSession) -> SparkRuntimeType:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



def test_classify_spark_runtime(spark):
assert classify_spark_runtime(spark) != "DATABRICKS_SERVERLESS"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment that our sandbox environment uses spark connect so in case this changes in the future, the comment explains why it was done in this way originally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

…ss-compute' of github.com:databrickslabs/lakebridge into 1438-feature-remorph-reconcile-fails-to-run-on-serverless-compute
# Write the joined df to volume path
joined_volume_df = ReconIntermediatePersist(spark, path).write_and_read_unmatched_df_with_volumes(joined_df).cache()
joined_volume_df = ReconIntermediatePersist(spark, path).write_and_read_unmatched_df_with_volumes(joined_df)
cluster_type = classify_spark_runtime(spark)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is used inside a loop and we should not calculate this every time. I would either cache this value or calculate at the caller before any loops and pass it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can create ReconIntermediatePersist(spark, path) before the loop and pass it in, but since joined_df changes with the loop parameters, how can we cache it ahead of time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think a ReconIntermediatePersist#cache_if_supported() method is a good idea

Copy link
Contributor

@m-abulazm m-abulazm Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another question though; why are we caching here after writing to storage? what is the improvement? otherwise we can just remove it from here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (cache) approach was already there; I didn’t add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

SparkRuntimeType = Literal["DATABRICKS_SERVERLESS", "CLASSIC", "SPARK_CONNECT", "NO_JVM_UNKNOWN"]


def classify_spark_runtime(spark) -> SparkRuntimeType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def classify_spark_runtime(spark) -> SparkRuntimeType:
def classify_spark_runtime(spark: SparkSession) -> SparkRuntimeType:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

…ss-compute' of github.com:databrickslabs/lakebridge into 1438-feature-remorph-reconcile-fails-to-run-on-serverless-compute
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Remorph Reconcile fails to run on serverless compute

3 participants