Proposal for dbfs file access api with mocked test suite #3236
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes
scan-tables-in-mounts.Why this PR?
Refactoring
scan-tables-in-mountsto enhance performance will need to leverage parallelized crawling ofdbfs:/mnt/locations. @nfx suggested using backend Hadoop libraries throughpy4jfor this through theSparkSession. Since this is a fairly big change, I wanted to make sure we are aligned on the backend file lister before I proceed to modify any existing files.Enable
dbfsfile listingThis PR proposes the
DbfsFilesclass which uses theSparkSession's java backend withpy4jto leverage the following Hadoop libraries for efficientdbfs:/(andmount) file access:org.apache.hadoop.fs.FileSystemorg.apache.hadoop.fs.PathAs of now, the only useful method is
list_dir, but if the approach is confirmed I plan to add crawling capabilities leveragingdatabricks.labs.blueprint.parallel.Test strategy
A testing strategy was painstakingly created to:
scan-tables-in-mounts. I created a simple trie-based mock file system which can be leveraged forscan-tables-in-mountsfunctionality after the refactor. This mock file system also has it's own unit tests.Functionality
databricks labs ucx .........Tests