Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Nov 1, 2024

Purpose and background context

This PR partially addresses TIMX-379 which is concerned with performing large / complex runs.

The goal: perform a run of ABDiff where the input files are for ALL sources in TIMDEX, going back to their most recent "full" run, and then all subsequent "daily" runs.

This full list of input files is around 1.6k (very many!) and exposes two challenges:

  1. creating this list of input files
  2. passing these to the CLI command run-diff as comma seperated values for -i / --input-files

This PR addresses this in two ways:

  1. create a new CLI command timdex-sources-csv that will generate a CSV of input files based on some criteria
  2. updates CLI command run-diff to support a CSV for the -i / --input-files argument for input files

I would highlight a couple of characteristics of these changes:

  • they are mostly purely additive, meaning no original logic or behavior is changed
  • they are not well tested

I anticipate that the "not well tested" part makes some nervous, but as purely additive changes for just initiating runs, I would propose that's okay. These changes support larger runs for testing at scale, which in turn focus and direct work on bugs and features that do warrant more rigourous testing.

How can a reviewer manually see the effects of these changes?

1- Create a job

pipenv run abdiff --verbose init-job -d output/jobs/csv -a 008e20c -b 395e612

2- Set AWS credentials for production in .env file

3- Set additional env var required for CLI command timdex-sources-csv:

TIMDEX_BUCKET=timdex-extract-prod-300442551476

4- Run new CLI command to generate a CSV file of input files for libguides and gismit sources

pipenv run abdiff --verbose timdex-sources-csv \
-o output/jobs/csv/test.csv \
-s libguides,gismit
  • observe that new CSV file created at output/jobs/csv/test.csv contains ~182 S3 URIs
  • files are ordered by last "full" run then all following "daily" runs

5- Run diff and pass CSV

pipenv run abdiff --verbose run-diff \
-d output/jobs/csv \
-i output/jobs/csv/test.csv
  • observe that run uses CSVs
  • though 182 input files (so 364 containers), runs pretty quick given small input file sizes, about 2-3 minutes total time

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:

This application is a standalone app to compare A/B versions of the
output of Transmogrifier.  To this end, there are some rote and repetitive
tasks that may need to be performed to provide the inputs this application
needs.

How this addresses that need:
* Establishes a 'abdiff.helper' module that contains code helpful
for working with this application, but not essential to its core functionality
* First sub-module of helpers is code that helps you get lists of ordered
files from TIMDEX S3 bucket

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-379
Why these changes are being introduced:

For very large runs, the use of command line, comman seperated input files
for Transmogrifier does not scale.  Two things are needed and helpful:

1. allow CLI command run-diff to accept a list of input files via another method
2. a tool to help craft lists of input files from S3 based on some criteria

How this addresses that need:
* A new CLI command is added 'timdex-sources-csv' that generates a list of input
files based on the S3 bucket set by env var 'TIMDEX_BUCKET'.  This utilizes the
helper methods added via a former commit.  Default is, for all sources, to list
extract files going back to the last full run, then every subsequent daily run.
This also can be limited to a subset of sources.

* Update CLI command run-diff to allow the -i / --input-files argument to
accept a local CSV file which is parsed and used as the input files list.

Side effects of this change:
* None really, except able to more easily support runs with large amounts
of input files.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-379
@ghukill ghukill marked this pull request as ready for review November 1, 2024 14:20
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 optional suggestions but otherwise approved. I would caution that the helpers vs utils line may get fuzzy at times. I get not wanting to create über modules and I see the distinction you're making between essential vs. convenience but I think it forces a decision on where to put future global functions that may not always be clear. But not asking for a change, just raising the issue 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally optional: I would argue inverting the current order of the non-private functions in this module is a more logical order since each is called by the one below it. In other repos, we've generally had the most expansive function, like get_ordered_extracted_files_all_sources, up at the top

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a reordering would make the flow easier to follow. 🤔 I think we've had some conversations about the order of functions/methods (and can't recall where we landed), but I would expect:

get_ordered_extracted_files_all_sources # kinda' an "entrypoint" / main function called by CLI
get_ordered_extracted_files_since_last_full_run
get_extracted_files_for_source
<everything else>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take your points, and agree that often the "most expensive" methods tend to drift towards the top. However, I also think that organizing methods in the order a user or application may use them also makes sense. Like you might see a database class with an order like:

class DBClient:
  def connect():
  def close():
  def query():  

...where query() is arguably the most expensive, but conceptually also comes after managing connections.

Counter-point for this file in question: they read sequentially in order of granularity:

  1. list files
  2. list files, for a single source
  3. list files, for a single source, do so for all sources

And where small, private functions like _is_full_extract() follow the first function that uses them.

I think of it almost like a file that contains functions for math, where I think a logical ordering would be:

def add():
def subtract():
def multiply():
def divide():
def exponnent():
...

I think this file is unique, in that it likely won't change, or anything more complex will build on these and thus slot to the bottom.

But, happy to reorder here, and will do so! Just wanted to share these thoughts for future PRs, that there might be more nuance and context than just biggest and most expensive at the top.

WEBAPP_PORT=# port for flask webapp
TRANSMOGRIFIER_MAX_WORKERS=# max number of Transmogrifier containers to run in parallel; default is 6
TRANSMOGRIFIER_TIMEOUT=# timeout for a single Transmogrifier container; default is 5 hours
TIMDEX_BUCKET=# when using CLI command 'timdex-sources-csv', this is required to know what TIMDEX bucket to use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential future enhancement: couldn't this env var be relevant for other CLI commands? Once we can get transmogrifier sped up, I would assume diffs on all sources in a bucket would be more common than diffs on individual S3 files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I totally follow...

For running diffs we currently pass full S3 URIs, which do contain the TIMDEX bucket. This is required now for the TIMDEX sources extra functions, which need to know what bucket to scan.

Are you suggesting that maybe in the future, we'll just pass the TIMDEX bucket, and ABDiff would just run it against everything? If so -- I love the audacity -- but I'd suggest we update the docstring when that's more feasible, and more specifically when the run-diff argument -i / --input-files is no longer required.

@ghukill
Copy link
Contributor Author

ghukill commented Nov 1, 2024

2 optional suggestions but otherwise approved. I would caution that the helpers vs utils line may get fuzzy at times. I get not wanting to create über modules and I see the distinction you're making between essential vs. convenience but I think it forces a decision on where to put future global functions that may not always be clear. But not asking for a change, just raising the issue 🙂

Definitely take your point, and was not terribly happy with helpers. I'm glad you mentioned this, as I just interacted with a python library where I like a convention they use: extras. It's the psycopg2 Postgres DB driver library, and extras are bits and bobs that have nothing to do with core functionality, but are nice little add-ons that help you work with the library in different contexts. Like, here, where it helps you create artifacts for running diffs... but not the diffs themselves.

I would think then that:

  • utils --> implies utility functions and code used by the core application
  • extras --> functions, wrappers, data, etc. that help in specific conditions or contexts

An example of an extra could be something like parallel programming. It may wrap the entire application in some kind of fancy parallel programming paradigm, but absolutely not required and not part of core functionality.

Going to rename to extras and will re-request review, at which point please pushback if a worse name!

@ghukill ghukill requested a review from ehanson8 November 1, 2024 20:26
@ghukill
Copy link
Contributor Author

ghukill commented Nov 1, 2024

@ehanson8, @jonavellecuerdo - function reordering and helpers to extras rename here: 5701e9b

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghukill Two small comments but looking good to me! I also ran the CLI function and it works as expected -- quick, too. :D

return [file for file in files if "extracted" in file]


def get_ordered_extracted_files_since_last_full_run(source: str) -> list[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what do you think about: updating utils.parse_timdex_filename to accept an optional second argument: a list of strings denoting filename detail components (e.g. "date", "run-type"); if the second arg is provided the function returns only the specified components? You could then update the "private" methods is_full_extract, is_daily_extract, and _extract_date to call the util. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting...

While conceptually I like it, I think I might lean into this code being "extra" in the sense it shouldn't apply too much pressure for change to the "core" code. If we had reason to extract single values from the filenames in "core" code, that'd be one thing... but I think we primarily want at least 3-4 items each time we use it.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think it's fine as-is! I think considering the parse_timdex_filename utility as part of the "core" set of functions makes sense, too.

Comment on lines 53 to 56
# Collect all full extract files with the most recent date
most_recent_full_files = sorted(
f for f in full_extracts if _extract_date(f) == most_recent_full_date
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: most_recent_full_files is plural because of the possibility of sequenced files (i.e., files ending with an index like "01"), right? Is there ever a scenario where we'd have: two "full" extract files for the same source on the same date (not sequenced)? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. It used to be most_recent_full_file!

@coveralls
Copy link

Pull Request Test Coverage Report for Build 11635874202

Details

  • 27 of 67 (40.3%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-4.9%) to 91.207%

Changes Missing Coverage Covered Lines Changed/Added Lines %
abdiff/config.py 2 3 66.67%
abdiff/cli.py 12 20 60.0%
abdiff/extras/timdex_sources.py 13 44 29.55%
Totals Coverage Status
Change from base Build 11630326869: -4.9%
Covered Lines: 695
Relevant Lines: 762

💛 - Coveralls

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extras is fine by me

@ghukill ghukill merged commit fd3a0c8 into main Nov 1, 2024
2 checks passed
@ghukill ghukill deleted the TIMX-379-input-files-options-and-helpers branch November 5, 2024 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants