-
Notifications
You must be signed in to change notification settings - Fork 0
Refactor Spark 2 Reader to avoid looking up the snapshot schema separately #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: schema-for-snapshot
Are you sure you want to change the base?
Conversation
According to Edwin Choi, in order to get the schema for a snapshot, the only safe option is to scan the metadata files to find the one where the current-snapshot-id matches target snapshot id.
The changes are mostly in spark3. They are necessitated by the catalog support introduced in apache#1783. As the spark3 IcebergSource now implements SupportsCatalogOptions, DataFrameReader#load no longer calls IcebergSource#getTable but calls SparkCatalog#loadTable directly. In order for the SparkTable returned by SparkCatalog#loadTable(Identifier) to be aware of the snapshot, the information about the snapshot needs to be present in the Identifier. For this reason, we introduce a SnapshotAwareIdentifier interface extending Identifier. As SupportsCatalogOptions does not allow a schema to be specified (requested), SparkTable no longer needs a requestedSchema field, so some dead code is removed from it.
Rebased on master. Use constants from SparkReadOptions. Implement snapshotSchema() in SparkFilesScan as it extends SparkBatchScan.
Avoid introducing new methods to BaseTable. Add helper methods to SnapshotUtil instead. Move recovery of schema from previous metadata files in the event that snapshot does not have associated schema id to new PR. Remove snapshotSchema method from SparkBatchScam and its subclasses, as it is not needed. Adjust schema in BaseTableScan when useSnapshot is called.
Use the existing CatalogAndIdentifier and swap out the Identifier for a snapshot-aware TableIdentifier if snapshotId or asOfTimestamp is set.
|
@rdblue there are 4 unit test failures in the spark2 module with this change:
|
| this.startSnapshotId = options.get("start-snapshot-id").map(Long::parseLong).orElse(null); | ||
| this.endSnapshotId = options.get("end-snapshot-id").map(Long::parseLong).orElse(null); | ||
| if (snapshotId != null || asOfTimestamp != null) { | ||
| if (startSnapshotId != null || endSnapshotId != null) { | ||
| throw new IllegalArgumentException( | ||
| "Cannot specify start-snapshot-id and end-snapshot-id to do incremental scan when either snapshot-id or " + | ||
| "as-of-timestamp is specified"); | ||
| } | ||
| } else { | ||
| if (startSnapshotId == null && endSnapshotId != null) { | ||
| throw new IllegalArgumentException("Cannot only specify option end-snapshot-id to do incremental scan"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These checks are tested in TestDataSourceOptions.testIncrementalScanOptions. We need to retain them (somewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are checked in the table scan itself rather than in multiple places. I think we just need to update the tests to match the error messages from the scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us move this cleanup (of checks in multiple places) to a separate PR after apache#1508 is merged. It needs to be done for spark3 as well, and is really orthogonal to the main issue.
|
@rdblue thanks for the PR showing me what you had in mind. I won't merge it but will incorporate it into my next update of apache#1508. |
1390ce5 to
0ef74ff
Compare
290c4a0 to
afec25c
Compare
…flake-managed Iceberg tables (apache#6428) * Initial read-only Snowflake Catalog implementation by @sfc-gh-mparmar (#1) Initial read-only Snowflake Catalog implementation built on top of the Snowflake JDBC driver, providing support for basic listing of namespaces, listing of tables, and loading/reads of tables. Auth options are passthrough to the JDBC driver. Co-authored-by: Maninder Parmar <[email protected]> Co-authored-by: Maninder Parmar <[email protected]> Co-authored-by: Dennis Huo <[email protected]> * Add JdbcSnowflakeClientTest using mocks (apache#2) Add JdbcSnowflakeClientTest using mocks; provides full coverage of JdbcSnowflakeClient and entities' ResultSetHandler logic. Also update target Spark runtime versions to be included. * Add test { useJUnitPlatform() } tuple to iceberg-snowflake for consistency and future interoperability with inheriting from abstact unittest base classes. * Extract versions into versions.props per PR review * Misc test-related refactors per review suggestions -Convert unittests to all use assertj/Assertions for "fluent assertions" -Refactor test injection into overloaded initialize() method -Add test cases for close() propagation -Use CloseableGroup. * Fix unsupported behaviors of loadNamedpaceMetadata and defaultWarehouseLocation * Move TableIdentifier checks out of newTableOps into the SnowflakTableOperations class itself, add test case. * Refactor out any Namespace-related business logic from the lower SnowflakeClient/JdbcSnowflakeClient layers and merge SnowflakeTable and SnowflakeSchema into a single SnowflakeIdentifier that also encompasses ROOT and DATABASE level identifiers. A SnowflakeIdentifier thus functions like a type-checked/constrained Iceberg TableIdentifier, and eliminates any tight coupling between a SnowflakeClient and Catalog business logic. Parsing of Namespace numerical levels into a SnowflakeIdentifier is now fully encapsulated in NamespaceHelpers so that callsites don't duplicate namespace-handling/validation logic. * Finish migrating JdbcSnowflakeClientTest off any usage of org.junit.Assert in favor of assertj's Assertions. * Style refactorings from review comments, expanded and moved InMemoryFileIO into core with its own unittest. * Fix behavior of getNamespaceMetadata to throw when the namespace doesn't exist. Refactor for naming conventions and consolidating identifier handling into NamespaceHandlers. Make FileIO instantiated fresh for each newTableOps call. * Move private constructor to top, add assertion to test case. * Define minimal ResultSetParser/QueryHarness classes to fully replace any use of commons-dbutils; refactor ResultSet handling fully into JdbcSnowflakeClient.java. * Update snowflake/src/main/java/org/apache/iceberg/snowflake/SnowflakeTableOperations.java Co-authored-by: Eduard Tudenhöfner <[email protected]> * Refactor style suggestions; remove debug-level logging, arguments in exceptions, private members if not accessed outside, move precondition checks, add test for NamespaceHelpers. * Fix precondition messages, remove getConf() * Clean up varargs. * Make data members final, include rawJsonVal in toString for debuggability. * Combine some small test cases into roundtrip test cases, misc cleanup * Add comment for why a factory class is exposed for testing purposes. Co-authored-by: Dennis Huo <[email protected]> Co-authored-by: Maninder Parmar <[email protected]> Co-authored-by: Maninder Parmar <[email protected]> Co-authored-by: Eduard Tudenhöfner <[email protected]>
This makes an update I requested to the Spark 2 reader. It doesn't update Spark 3 because that refactor is much larger. I'll open a PR for that once apache#1508 is in.