-
Notifications
You must be signed in to change notification settings - Fork 0
Add option to download input files using a local MinIO server #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: * Downloading extract files improves the performance of the app by reducing requests sent to AWS S3 and avoiding repeated downloads of extract files used across multiple container runs. Having extract files available on local disk also minimizes the occurence of network issues or AWS credentials timing out during a transform. These changes introduces a locally hosted MinIO server to act as a "local S3 bucket" as part of the A/B diff workflow. How this addresses that need: * Add a Docker Compose YAML file to run local MinIO server * Add Makefile commands for starting and stopping local MinIO server * Add option '--download-files' to run-diff CLI command * Implement download_input_files core function * Update run_ab_transforms to suport use of local MinIO server Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-353
4daf2ec to
baa1cef
Compare
Pull Request Test Coverage Report for Build 11727589670Details
💛 - Coveralls |
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think this looks really good. Nice work @jonavellecuerdo! As we've discussed on huddles, I think this work is more in the "ergonomics" camp than raw functionality, which is often much harder in my opinion.
I was able to get it running successfully, performing a run via local files via my Minio instance. Whoohoo!
I have left a handful of comments, which tips into requesting some changes. Not all are mandatory (requested), but a couple jump out:
- simplify Minio docker compose file
- move references to
abdiff/helperstoabdiff/extras - consider moving
download_input_files()toabdiff/extras/minio
I also left a question about concurrent downloads, but truly happy to wait on that to keep this PR focused on functionality.
abdiff/cli.py
Outdated
| ) | ||
| def run_diff(job_directory: str, input_files: str, message: str) -> None: | ||
| @click.option( | ||
| "--download-files", is_flag=True, help="Pass to skip download of extract files" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the help message need updating? I would think it's something like,
"Pass to download input files from AWS S3 to a local Minio S3 server for Transmogrifier to use."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, amazing how simple this is to implement! I second Graham's requested changes
abdiff/core/download_input_files.py
Outdated
| aws_secret_access_key=CONFIG.minio_root_password, | ||
| ) | ||
|
|
||
| for input_file in input_files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @ghukill 's proposed changes!
* Move download functionality to extras module * Update Makefile command for stopping MinIO server to use .env * Clarify instructions for running local MinIO server in README * Fix flag description for '--download-files' CLI option * Simplify Docker Compose YAML file * Simplify logging in 'download_input_files' * Log clear messages indicating download counts * Log clear messages * Raise RuntimeError if any files fail to download
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All changes look great! Approved.
| f"Input file: {i + 1} / {len(input_files)}: '{input_file}' " | ||
| "available locally for transformation." | ||
| ) | ||
| except subprocess.CalledProcessError: | ||
| fail_count += 1 | ||
| logger.info( | ||
| f"Input file: {i + 1} / {len(input_files)}: '{input_file}' " | ||
| "failed to download." | ||
| ) | ||
| logger.info( | ||
| f"Available input files: {success_count}, missing input files: {fail_count}." | ||
| ) | ||
|
|
||
| if fail_count > 0: | ||
| raise RuntimeError( # noqa: TRY003 | ||
| f"{fail_count} input file(s) failed to download." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I like the consistent logging structure, the final tally, and the exception raised if any failures. This will be helpful for debugging large amounts of files to download, if anything goes wrong.
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
Purpose and background context
Add functionality to download extract files from AWS S3 to a local MinIO server that acts as a "local S3 bucket" . This update improves the performance of the app by reducing the number of data pulls / requests sent to AWS S3 (potentially reducing costs down the line as the app gets more use) and avoiding repeated downloads of extract files.
Below are some key takeaways:
run-diffis to not download extract files (i.e., Docker containers access input files from AWS S3 given the appropriate credentials as environment variables).--download-fileswhen invoking the CLI commandrun-diff.Only one of these new environment variables must be explicitly set in the.envfile (MINIO_S3_LOCAL_STORAGE); default values are set for the four other environment variables inabdiff.config.UPDATED: Three of these new environment variables must be explicitly set in the
.envfile.How can a reviewer manually see the effects of these changes?
No additional unit tests have been added as these changes are somewhat complex to model via unit tests. However, example commands (with screenshots of the output) should be sufficient for purposes of review.
Follow the instructions in README | Running a Local MinIO Server to launch a local MinIO server with a Docker container. Make sure to set Dev AWS credentials in your
.envfile.Run the following CLI commands in order:
pipenv run abdiff --verbose init-job \ -d output/jobs/test-minio \ -m "test job with MinIO support" \ -a 395e612 \ -b cf1024cReview the output (redacted to focus on
download_input_filesandrun_ab_transforms:Run the following command that includes an invalid input file.
This will result in the following logs and error:
Includes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)