Introduction

While Delta Lake is a well-known industry-standard approach to versioning structured datasets, sometimes Machine Learning / Deep Learning problems require unstructured data for model training and testing. Examples include:

Image / video files for computer vision problems
Text files / documents for NLP problems
Audio files for sound-related ML problems

Now, Delta Lake actually supports image files as well, but there are limitations on file size, specifically:

For large image files (average image size greater than 100 MB), Databricks recommends using the Delta table only to manage the metadata (list of file names) and loading the images from the object store using their paths when needed.

Proposed Approach

Since Delta Lake already provides us great capabilities for structured data versioning, why don't we also use it for unstructured data?

The high-level flow:

We keep a catalog of the unstructured assets (in a given filepath), stored as a Delta Lake table
Whenever there are new files, we update the catalog (which increments the Delta Table version number)
Whenever there are changed files, we update the corresponding file's record in the catalog - perhaps with the checksum?
Whenever there are removed files, we tombstone the correponding file's record in the catalog

There are some pretty obvious limitations to this approach -

Does not facilitate time travel - we maintain just a record of the metadata, not the actual data files themselves, so we can't revert changed files, or recover removed files.
Needs a manual / frequently recurring job to scan the filepath and determine what has changed

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
notebooks		notebooks
LICENSE		LICENSE
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Proposed Approach

About

Uh oh!

Releases

Packages

Languages

License

tauhlim/databricks-unstructured-data-versioning

Folders and files

Latest commit

History

Repository files navigation

Introduction

Proposed Approach

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages