-
Notifications
You must be signed in to change notification settings - Fork 0
TIMX 379 - CSV for input files support and helpers #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: This application is a standalone app to compare A/B versions of the output of Transmogrifier. To this end, there are some rote and repetitive tasks that may need to be performed to provide the inputs this application needs. How this addresses that need: * Establishes a 'abdiff.helper' module that contains code helpful for working with this application, but not essential to its core functionality * First sub-module of helpers is code that helps you get lists of ordered files from TIMDEX S3 bucket Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-379
Why these changes are being introduced: For very large runs, the use of command line, comman seperated input files for Transmogrifier does not scale. Two things are needed and helpful: 1. allow CLI command run-diff to accept a list of input files via another method 2. a tool to help craft lists of input files from S3 based on some criteria How this addresses that need: * A new CLI command is added 'timdex-sources-csv' that generates a list of input files based on the S3 bucket set by env var 'TIMDEX_BUCKET'. This utilizes the helper methods added via a former commit. Default is, for all sources, to list extract files going back to the last full run, then every subsequent daily run. This also can be limited to a subset of sources. * Update CLI command run-diff to allow the -i / --input-files argument to accept a local CSV file which is parsed and used as the input files list. Side effects of this change: * None really, except able to more easily support runs with large amounts of input files. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-379
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 optional suggestions but otherwise approved. I would caution that the helpers vs utils line may get fuzzy at times. I get not wanting to create über modules and I see the distinction you're making between essential vs. convenience but I think it forces a decision on where to put future global functions that may not always be clear. But not asking for a change, just raising the issue 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally optional: I would argue inverting the current order of the non-private functions in this module is a more logical order since each is called by the one below it. In other repos, we've generally had the most expansive function, like get_ordered_extracted_files_all_sources, up at the top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a reordering would make the flow easier to follow. 🤔 I think we've had some conversations about the order of functions/methods (and can't recall where we landed), but I would expect:
get_ordered_extracted_files_all_sources # kinda' an "entrypoint" / main function called by CLI
get_ordered_extracted_files_since_last_full_run
get_extracted_files_for_source
<everything else>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take your points, and agree that often the "most expensive" methods tend to drift towards the top. However, I also think that organizing methods in the order a user or application may use them also makes sense. Like you might see a database class with an order like:
class DBClient:
def connect():
def close():
def query(): ...where query() is arguably the most expensive, but conceptually also comes after managing connections.
Counter-point for this file in question: they read sequentially in order of granularity:
- list files
- list files, for a single source
- list files, for a single source, do so for all sources
And where small, private functions like _is_full_extract() follow the first function that uses them.
I think of it almost like a file that contains functions for math, where I think a logical ordering would be:
def add():
def subtract():
def multiply():
def divide():
def exponnent():
...I think this file is unique, in that it likely won't change, or anything more complex will build on these and thus slot to the bottom.
But, happy to reorder here, and will do so! Just wanted to share these thoughts for future PRs, that there might be more nuance and context than just biggest and most expensive at the top.
| WEBAPP_PORT=# port for flask webapp | ||
| TRANSMOGRIFIER_MAX_WORKERS=# max number of Transmogrifier containers to run in parallel; default is 6 | ||
| TRANSMOGRIFIER_TIMEOUT=# timeout for a single Transmogrifier container; default is 5 hours | ||
| TIMDEX_BUCKET=# when using CLI command 'timdex-sources-csv', this is required to know what TIMDEX bucket to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential future enhancement: couldn't this env var be relevant for other CLI commands? Once we can get transmogrifier sped up, I would assume diffs on all sources in a bucket would be more common than diffs on individual S3 files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I totally follow...
For running diffs we currently pass full S3 URIs, which do contain the TIMDEX bucket. This is required now for the TIMDEX sources extra functions, which need to know what bucket to scan.
Are you suggesting that maybe in the future, we'll just pass the TIMDEX bucket, and ABDiff would just run it against everything? If so -- I love the audacity -- but I'd suggest we update the docstring when that's more feasible, and more specifically when the run-diff argument -i / --input-files is no longer required.
Definitely take your point, and was not terribly happy with I would think then that:
An example of an Going to rename to |
|
@ehanson8, @jonavellecuerdo - function reordering and |
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghukill Two small comments but looking good to me! I also ran the CLI function and it works as expected -- quick, too. :D
| return [file for file in files if "extracted" in file] | ||
|
|
||
|
|
||
| def get_ordered_extracted_files_since_last_full_run(source: str) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what do you think about: updating utils.parse_timdex_filename to accept an optional second argument: a list of strings denoting filename detail components (e.g. "date", "run-type"); if the second arg is provided the function returns only the specified components? You could then update the "private" methods is_full_extract, is_daily_extract, and _extract_date to call the util. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting...
While conceptually I like it, I think I might lean into this code being "extra" in the sense it shouldn't apply too much pressure for change to the "core" code. If we had reason to extract single values from the filenames in "core" code, that'd be one thing... but I think we primarily want at least 3-4 items each time we use it.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think it's fine as-is! I think considering the parse_timdex_filename utility as part of the "core" set of functions makes sense, too.
| # Collect all full extract files with the most recent date | ||
| most_recent_full_files = sorted( | ||
| f for f in full_extracts if _extract_date(f) == most_recent_full_date | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: most_recent_full_files is plural because of the possibility of sequenced files (i.e., files ending with an index like "01"), right? Is there ever a scenario where we'd have: two "full" extract files for the same source on the same date (not sequenced)? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. It used to be most_recent_full_file!
Pull Request Test Coverage Report for Build 11635874202Details
💛 - Coveralls |
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extras is fine by me
Purpose and background context
This PR partially addresses TIMX-379 which is concerned with performing large / complex runs.
The goal: perform a run of ABDiff where the input files are for ALL sources in TIMDEX, going back to their most recent "full" run, and then all subsequent "daily" runs.
This full list of input files is around 1.6k (very many!) and exposes two challenges:
run-diffas comma seperated values for-i / --input-filesThis PR addresses this in two ways:
timdex-sources-csvthat will generate a CSV of input files based on some criteriarun-diffto support a CSV for the-i / --input-filesargument for input filesI would highlight a couple of characteristics of these changes:
I anticipate that the "not well tested" part makes some nervous, but as purely additive changes for just initiating runs, I would propose that's okay. These changes support larger runs for testing at scale, which in turn focus and direct work on bugs and features that do warrant more rigourous testing.
How can a reviewer manually see the effects of these changes?
1- Create a job
2- Set AWS credentials for production in
.envfile3- Set additional env var required for CLI command
timdex-sources-csv:4- Run new CLI command to generate a CSV file of input files for
libguidesandgismitsourcesoutput/jobs/csv/test.csvcontains ~182 S3 URIs5- Run diff and pass CSV
Includes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)