Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Nov 5, 2024

Purpose and background context

Add functionality to download extract files from AWS S3 to a local MinIO server that acts as a "local S3 bucket" . This update improves the performance of the app by reducing the number of data pulls / requests sent to AWS S3 (potentially reducing costs down the line as the app gets more use) and avoiding repeated downloads of extract files.

Below are some key takeaways:

  1. The default behavior of CLI command run-diff is to not download extract files (i.e., Docker containers access input files from AWS S3 given the appropriate credentials as environment variables).
  2. To download input files, pass the flag --download-files when invoking the CLI command run-diff.
  3. Instructions for using the workflow with MinIO are provided in README | Running a Local MinIO Server.
  4. This update introduces five (5) new, optional environment variables. Only one of these new environment variables must be explicitly set in the .env file (MINIO_S3_LOCAL_STORAGE); default values are set for the four other environment variables in abdiff.config.
    UPDATED: Three of these new environment variables must be explicitly set in the .env file.

How can a reviewer manually see the effects of these changes?

No additional unit tests have been added as these changes are somewhat complex to model via unit tests. However, example commands (with screenshots of the output) should be sufficient for purposes of review.

  1. Follow the instructions in README | Running a Local MinIO Server to launch a local MinIO server with a Docker container. Make sure to set Dev AWS credentials in your .env file.

  2. Run the following CLI commands in order:

    • Initiate a job.
      pipenv run abdiff --verbose init-job \
         -d output/jobs/test-minio \
        -m "test job with MinIO support" \
        -a 395e612 \
        -b cf1024c
    • Run a diff:
      pipenv run abdiff --verbose run-diff \
         -d output/jobs/test-minio \
         -i "s3://timdex-extract-dev-222053980223/libguides/libguides-2024-06-03-full-extracted-records-to-index.xml,s3://timdex-extract-dev-222053980223/dspace/dspace-2024-03-06-daily-extracted-records-to-index.xml" \
         -m "running diff on two input files with MinIO support" \
         --download-files
  3. Review the output (redacted to focus on download_input_files and run_ab_transforms:

2024-11-07 12:02:17,617 INFO abdiff.cli.main() line 44: Logger 'root' configured with level=DEBUG
2024-11-07 12:02:17,617 INFO abdiff.cli.main() line 45: Running process
2024-11-07 12:02:17,618 INFO abdiff.core.init_run.init_run() line 24: Run directory created: output/jobs/test-minio/runs/2024-11-07_17-02-17
2024-11-07 12:02:17,747 INFO abdiff.extras.minio.download_input_files.download_input_files() line 49: Input file: 1 / 2: 's3://timdex-extract-dev-222053980223/libguides/libguides-2024-06-03-full-extracted-records-to-index.xml' available locally for transformation.
2024-11-07 12:02:17,751 INFO abdiff.extras.minio.download_input_files.download_input_files() line 49: Input file: 2 / 2: 's3://timdex-extract-dev-222053980223/dspace/dspace-2024-03-06-daily-extracted-records-to-index.xml' available locally for transformation.
2024-11-07 12:02:17,751 INFO abdiff.extras.minio.download_input_files.download_input_files() line 59: Available input files: 2, missing input files: 0.
  1. Run the following command that includes an invalid input file.

    pipenv run abdiff --verbose run-diff \
       -d output/jobs/test-minio \
       -i "s3://timdex-extract-dev-222053980223/libguides/libguides-2024-06-03-full-extracted-records-to-index.xml,s3://timdex-extract-dev-222053980223/dspace/dspace-2024-03-06-daily-extracted-records-to-index.xml,s3://timdex-extract-dev-222053980223/libguides/libguides-2024-11-07-daily-extracted-records-to-index.xml" \
       -m "running diff on two input files with MinIO support" \
       --download-files

    This will result in the following logs and error:

2024-11-07 12:03:27,559 INFO abdiff.cli.main() line 44: Logger 'root' configured with level=DEBUG
2024-11-07 12:03:27,559 INFO abdiff.cli.main() line 45: Running process
2024-11-07 12:03:27,561 INFO abdiff.core.init_run.init_run() line 24: Run directory created: output/jobs/test-minio/runs/2024-11-07_17-03-27
2024-11-07 12:03:27,683 INFO abdiff.extras.minio.download_input_files.download_input_files() line 49: Input file: 1 / 3: 's3://timdex-extract-dev-222053980223/libguides/libguides-2024-06-03-full-extracted-records-to-index.xml' available locally for transformation.
2024-11-07 12:03:27,688 INFO abdiff.extras.minio.download_input_files.download_input_files() line 49: Input file: 2 / 3: 's3://timdex-extract-dev-222053980223/dspace/dspace-2024-03-06-daily-extracted-records-to-index.xml' available locally for transformation.
2024-11-07 12:03:28,816 INFO abdiff.extras.minio.download_input_files.download_input_files() line 55: Input file: 3 / 3: 's3://timdex-extract-dev-222053980223/libguides/libguides-2024-11-07-daily-extracted-records-to-index.xml' failed to download.
2024-11-07 12:03:28,816 INFO abdiff.extras.minio.download_input_files.download_input_files() line 59: Available input files: 2, missing input files: 1.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/jcuerdo/.local/share/virtualenvs/transmogrifier-ab-diff-3LIAwCT6/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jcuerdo/.local/share/virtualenvs/transmogrifier-ab-diff-3LIAwCT6/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/jcuerdo/.local/share/virtualenvs/transmogrifier-ab-diff-3LIAwCT6/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jcuerdo/.local/share/virtualenvs/transmogrifier-ab-diff-3LIAwCT6/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jcuerdo/.local/share/virtualenvs/transmogrifier-ab-diff-3LIAwCT6/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jcuerdo/Documents/repos/transmogrifier-ab-diff/abdiff/cli.py", line 175, in run_diff
    download_input_files(input_files_list)
  File "/Users/jcuerdo/Documents/repos/transmogrifier-ab-diff/abdiff/extras/minio/download_input_files.py", line 64, in download_input_files
    raise RuntimeError(  # noqa: TRY003
RuntimeError: 1 input file(s) failed to download.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

@jonavellecuerdo jonavellecuerdo self-assigned this Nov 5, 2024
Why these changes are being introduced:
* Downloading extract files improves the performance of the app by
reducing requests sent to AWS S3 and avoiding repeated downloads of
extract files used across multiple container runs. Having extract files
available on local disk also minimizes the occurence of network issues
or AWS credentials timing out during a transform. These changes introduces
a locally hosted MinIO server to act as a "local S3 bucket" as part of
the A/B diff workflow.

How this addresses that need:
* Add a Docker Compose YAML file to run local MinIO server
* Add Makefile commands for starting and stopping local MinIO server
* Add option '--download-files' to run-diff CLI command
* Implement download_input_files core function
* Update run_ab_transforms to suport use of local MinIO server

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-353
@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-353-download-extract-files branch from 4daf2ec to baa1cef Compare November 5, 2024 22:20
@coveralls
Copy link

coveralls commented Nov 5, 2024

Pull Request Test Coverage Report for Build 11727589670

Details

  • 27 of 61 (44.26%) changed or added relevant lines in 5 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-3.5%) to 87.653%

Changes Missing Coverage Covered Lines Changed/Added Lines %
abdiff/cli.py 3 4 75.0%
abdiff/core/run_ab_transforms.py 4 5 80.0%
abdiff/config.py 8 12 66.67%
abdiff/extras/minio/download_input_files.py 11 39 28.21%
Totals Coverage Status
Change from base Build 11691823866: -3.5%
Covered Lines: 717
Relevant Lines: 818

💛 - Coveralls

@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review November 6, 2024 14:02
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think this looks really good. Nice work @jonavellecuerdo! As we've discussed on huddles, I think this work is more in the "ergonomics" camp than raw functionality, which is often much harder in my opinion.

I was able to get it running successfully, performing a run via local files via my Minio instance. Whoohoo!

I have left a handful of comments, which tips into requesting some changes. Not all are mandatory (requested), but a couple jump out:

  • simplify Minio docker compose file
  • move references to abdiff/helpers to abdiff/extras
  • consider moving download_input_files() to abdiff/extras/minio

I also left a question about concurrent downloads, but truly happy to wait on that to keep this PR focused on functionality.

abdiff/cli.py Outdated
)
def run_diff(job_directory: str, input_files: str, message: str) -> None:
@click.option(
"--download-files", is_flag=True, help="Pass to skip download of extract files"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the help message need updating? I would think it's something like,

"Pass to download input files from AWS S3 to a local Minio S3 server for Transmogrifier to use."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, amazing how simple this is to implement! I second Graham's requested changes

aws_secret_access_key=CONFIG.minio_root_password,
)

for input_file in input_files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @ghukill 's proposed changes!

* Move download functionality to extras module
* Update Makefile command for stopping MinIO server to use .env
* Clarify instructions for running local MinIO server in README
* Fix flag description for '--download-files' CLI option
* Simplify Docker Compose YAML file
* Simplify logging in 'download_input_files'
  * Log clear messages indicating download counts
  * Log clear messages
* Raise RuntimeError if any files fail to download
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes look great! Approved.

Comment on lines +50 to +66
f"Input file: {i + 1} / {len(input_files)}: '{input_file}' "
"available locally for transformation."
)
except subprocess.CalledProcessError:
fail_count += 1
logger.info(
f"Input file: {i + 1} / {len(input_files)}: '{input_file}' "
"failed to download."
)
logger.info(
f"Available input files: {success_count}, missing input files: {fail_count}."
)

if fail_count > 0:
raise RuntimeError( # noqa: TRY003
f"{fail_count} input file(s) failed to download."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I like the consistent logging structure, the final tally, and the exception raised if any failures. This will be helpful for debugging large amounts of files to download, if anything goes wrong.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@jonavellecuerdo jonavellecuerdo merged commit 16ecf73 into main Nov 7, 2024
2 checks passed
@jonavellecuerdo jonavellecuerdo deleted the TIMX-353-download-extract-files branch November 7, 2024 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants