TIMX 379 - CSV for input files support and helpers #45

ghukill · 2024-11-01T14:01:16Z

Purpose and background context

This PR partially addresses TIMX-379 which is concerned with performing large / complex runs.

The goal: perform a run of ABDiff where the input files are for ALL sources in TIMDEX, going back to their most recent "full" run, and then all subsequent "daily" runs.

This full list of input files is around 1.6k (very many!) and exposes two challenges:

creating this list of input files
passing these to the CLI command run-diff as comma seperated values for -i / --input-files

This PR addresses this in two ways:

create a new CLI command timdex-sources-csv that will generate a CSV of input files based on some criteria
updates CLI command run-diff to support a CSV for the -i / --input-files argument for input files

I would highlight a couple of characteristics of these changes:

they are mostly purely additive, meaning no original logic or behavior is changed
they are not well tested

I anticipate that the "not well tested" part makes some nervous, but as purely additive changes for just initiating runs, I would propose that's okay. These changes support larger runs for testing at scale, which in turn focus and direct work on bugs and features that do warrant more rigourous testing.

How can a reviewer manually see the effects of these changes?

1- Create a job

pipenv run abdiff --verbose init-job -d output/jobs/csv -a 008e20c -b 395e612

2- Set AWS credentials for production in .env file

3- Set additional env var required for CLI command timdex-sources-csv:

TIMDEX_BUCKET=timdex-extract-prod-300442551476

4- Run new CLI command to generate a CSV file of input files for libguides and gismit sources

pipenv run abdiff --verbose timdex-sources-csv \
-o output/jobs/csv/test.csv \
-s libguides,gismit

observe that new CSV file created at output/jobs/csv/test.csv contains ~182 S3 URIs
files are ordered by last "full" run then all following "daily" runs

5- Run diff and pass CSV

pipenv run abdiff --verbose run-diff \
-d output/jobs/csv \
-i output/jobs/csv/test.csv

observe that run uses CSVs
though 182 input files (so 364 containers), runs pretty quick given small input file sizes, about 2-3 minutes total time

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-379

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: This application is a standalone app to compare A/B versions of the output of Transmogrifier. To this end, there are some rote and repetitive tasks that may need to be performed to provide the inputs this application needs. How this addresses that need: * Establishes a 'abdiff.helper' module that contains code helpful for working with this application, but not essential to its core functionality * First sub-module of helpers is code that helps you get lists of ordered files from TIMDEX S3 bucket Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-379

Why these changes are being introduced: For very large runs, the use of command line, comman seperated input files for Transmogrifier does not scale. Two things are needed and helpful: 1. allow CLI command run-diff to accept a list of input files via another method 2. a tool to help craft lists of input files from S3 based on some criteria How this addresses that need: * A new CLI command is added 'timdex-sources-csv' that generates a list of input files based on the S3 bucket set by env var 'TIMDEX_BUCKET'. This utilizes the helper methods added via a former commit. Default is, for all sources, to list extract files going back to the last full run, then every subsequent daily run. This also can be limited to a subset of sources. * Update CLI command run-diff to allow the -i / --input-files argument to accept a local CSV file which is parsed and used as the input files list. Side effects of this change: * None really, except able to more easily support runs with large amounts of input files. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-379

ehanson8

2 optional suggestions but otherwise approved. I would caution that the helpers vs utils line may get fuzzy at times. I get not wanting to create über modules and I see the distinction you're making between essential vs. convenience but I think it forces a decision on where to put future global functions that may not always be clear. But not asking for a change, just raising the issue 🙂

ehanson8 · 2024-11-01T18:43:25Z

abdiff/helpers/timdex_sources.py

Totally optional: I would argue inverting the current order of the non-private functions in this module is a more logical order since each is called by the one below it. In other repos, we've generally had the most expansive function, like get_ordered_extracted_files_all_sources, up at the top

I agree that a reordering would make the flow easier to follow. 🤔 I think we've had some conversations about the order of functions/methods (and can't recall where we landed), but I would expect:

get_ordered_extracted_files_all_sources # kinda' an "entrypoint" / main function called by CLI get_ordered_extracted_files_since_last_full_run get_extracted_files_for_source <everything else>

I take your points, and agree that often the "most expensive" methods tend to drift towards the top. However, I also think that organizing methods in the order a user or application may use them also makes sense. Like you might see a database class with an order like:

class DBClient: def connect(): def close(): def query():

...where query() is arguably the most expensive, but conceptually also comes after managing connections.

Counter-point for this file in question: they read sequentially in order of granularity:

list files

list files, for a single source

list files, for a single source, do so for all sources

And where small, private functions like _is_full_extract() follow the first function that uses them.

I think of it almost like a file that contains functions for math, where I think a logical ordering would be:

def add(): def subtract(): def multiply(): def divide(): def exponnent(): ...

I think this file is unique, in that it likely won't change, or anything more complex will build on these and thus slot to the bottom.

But, happy to reorder here, and will do so! Just wanted to share these thoughts for future PRs, that there might be more nuance and context than just biggest and most expensive at the top.

ehanson8 · 2024-11-01T18:57:55Z

README.md

 WEBAPP_PORT=# port for flask webapp
 TRANSMOGRIFIER_MAX_WORKERS=# max number of Transmogrifier containers to run in parallel; default is 6
 TRANSMOGRIFIER_TIMEOUT=# timeout for a single Transmogrifier container; default is 5 hours
+TIMDEX_BUCKET=# when using CLI command 'timdex-sources-csv', this is required to know what TIMDEX bucket to use


Potential future enhancement: couldn't this env var be relevant for other CLI commands? Once we can get transmogrifier sped up, I would assume diffs on all sources in a bucket would be more common than diffs on individual S3 files

Not sure I totally follow...

For running diffs we currently pass full S3 URIs, which do contain the TIMDEX bucket. This is required now for the TIMDEX sources extra functions, which need to know what bucket to scan.

Are you suggesting that maybe in the future, we'll just pass the TIMDEX bucket, and ABDiff would just run it against everything? If so -- I love the audacity -- but I'd suggest we update the docstring when that's more feasible, and more specifically when the run-diff argument -i / --input-files is no longer required.

ghukill · 2024-11-01T20:24:09Z

2 optional suggestions but otherwise approved. I would caution that the helpers vs utils line may get fuzzy at times. I get not wanting to create über modules and I see the distinction you're making between essential vs. convenience but I think it forces a decision on where to put future global functions that may not always be clear. But not asking for a change, just raising the issue 🙂

Definitely take your point, and was not terribly happy with helpers. I'm glad you mentioned this, as I just interacted with a python library where I like a convention they use: extras. It's the psycopg2 Postgres DB driver library, and extras are bits and bobs that have nothing to do with core functionality, but are nice little add-ons that help you work with the library in different contexts. Like, here, where it helps you create artifacts for running diffs... but not the diffs themselves.

I would think then that:

utils --> implies utility functions and code used by the core application
extras --> functions, wrappers, data, etc. that help in specific conditions or contexts

An example of an extra could be something like parallel programming. It may wrap the entire application in some kind of fancy parallel programming paradigm, but absolutely not required and not part of core functionality.

Going to rename to extras and will re-request review, at which point please pushback if a worse name!

ghukill · 2024-11-01T20:27:23Z

@ehanson8, @jonavellecuerdo - function reordering and helpers to extras rename here: 5701e9b

jonavellecuerdo

@ghukill Two small comments but looking good to me! I also ran the CLI function and it works as expected -- quick, too. :D

jonavellecuerdo · 2024-11-01T20:09:27Z

abdiff/helpers/timdex_sources.py

+    return [file for file in files if "extracted" in file]
+
+
+def get_ordered_extracted_files_since_last_full_run(source: str) -> list[str]:


Hmm, what do you think about: updating utils.parse_timdex_filename to accept an optional second argument: a list of strings denoting filename detail components (e.g. "date", "run-type"); if the second arg is provided the function returns only the specified components? You could then update the "private" methods is_full_extract, is_daily_extract, and _extract_date to call the util. 🤔

Interesting...

While conceptually I like it, I think I might lean into this code being "extra" in the sense it shouldn't apply too much pressure for change to the "core" code. If we had reason to extract single values from the filenames in "core" code, that'd be one thing... but I think we primarily want at least 3-4 items each time we use it.

Thoughts?

Ah, I think it's fine as-is! I think considering the parse_timdex_filename utility as part of the "core" set of functions makes sense, too.

jonavellecuerdo · 2024-11-01T20:14:37Z

abdiff/helpers/timdex_sources.py

+    # Collect all full extract files with the most recent date
+    most_recent_full_files = sorted(
+        f for f in full_extracts if _extract_date(f) == most_recent_full_date
+    )


Question: most_recent_full_files is plural because of the possibility of sequenced files (i.e., files ending with an index like "01"), right? Is there ever a scenario where we'd have: two "full" extract files for the same source on the same date (not sequenced)? 🤔

Exactly. It used to be most_recent_full_file!

coveralls · 2024-11-01T20:29:07Z

Pull Request Test Coverage Report for Build 11635874202

Details

27 of 67 (40.3%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-4.9%) to 91.207%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
abdiff/config.py	2	3	66.67%
abdiff/cli.py	12	20	60.0%
abdiff/extras/timdex_sources.py	13	44	29.55%

Totals
Change from base Build 11630326869:	-4.9%
Covered Lines:	695
Relevant Lines:	762

💛 - Coveralls

ehanson8

extras is fine by me

ghukill added 2 commits November 1, 2024 09:48

ghukill marked this pull request as ready for review November 1, 2024 14:20

ghukill requested review from ehanson8 and jonavellecuerdo November 1, 2024 14:21

ehanson8 approved these changes Nov 1, 2024

View reviewed changes

reorder timdex functions and organize under extras

5701e9b

ghukill requested a review from ehanson8 November 1, 2024 20:26

jonavellecuerdo reviewed Nov 1, 2024

View reviewed changes

ehanson8 approved these changes Nov 1, 2024

View reviewed changes

jonavellecuerdo approved these changes Nov 1, 2024

View reviewed changes

ghukill merged commit fd3a0c8 into main Nov 1, 2024
2 checks passed

ghukill deleted the TIMX-379-input-files-options-and-helpers branch November 5, 2024 14:21

		return [file for file in files if "extracted" in file]


		def get_ordered_extracted_files_since_last_full_run(source: str) -> list[str]:

TIMX 379 - CSV for input files support and helpers #45

TIMX 379 - CSV for input files support and helpers #45

Uh oh!

Conversation

ghukill commented Nov 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghukill commented Nov 1, 2024

Uh oh!

ghukill commented Nov 1, 2024

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Nov 1, 2024

Pull Request Test Coverage Report for Build 11635874202

Details

💛 - Coveralls

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ghukill commented Nov 1, 2024 •

edited

Loading