-
Notifications
You must be signed in to change notification settings - Fork 0
TIMX 371 - dedupe records #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d944b6b to
76ec3b8
Compare
Pull Request Test Coverage Report for Build 11629970056Details
💛 - Coveralls |
Why these changes are being introduced: Ideally, we could provide files to ABDiff in the way they accumulate in S3 where full runs are followed by dailes, sometimes there are deletes, etc. But in doing so, we would only end up with the most recent, non-deleted version of the record in the final dataset to analyze. How this addresses that need: * Adds step in collate_ab_transforms to dedupe records based on timdex_record_id, run_date, run_type, and action Side effects of this change: * Analysis will only contain the most recent version of a record even if the record is duplicated amongst input files. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-371
76ec3b8 to
2b03153
Compare
| def fetch_single_value(query: str) -> int: | ||
| result = con.execute(query).fetchone() | ||
| if result is None: | ||
| raise RuntimeError(f"Query returned no results: {query}") # pragma: nocover | ||
| return int(result[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's possible this could even be a more global abdiff.core.utils helper function... but this felt acceptable for the time being.
Using fetchone()[0] certainly makes sense logically, but type hinting doesn't like it. This was an attempt to remove a handful of typing and ruff ignores.
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but one suggestion
| if non_unique_count > 0: | ||
| raise OutputValidationError( | ||
| "The collated dataset contains duplicate 'timdex_record_id' records." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear if this is too much to ask (please push back if so), but it seems it would be helpful to know which timdex_record_id is duplicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hear ya, but I would pushback, for reasons of scale. If this were not working correctly, it's conceivable that 10's, 100's, thousands of records could be duplicated. I would posit it's sufficient that if any records are duplicated, something is intrinsically wrong with the deduplication logic; it's that that needs attention, and not a specific record.
Counter-point: we could show a sample like 10 records. And then during debugging that work, you could look for those? But... I suppose my preference would be to skip that for now, unless we have trouble with this in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me, the scale argument makes perfect sense! Agree there's no need to add unless we find it to be a problem
Why these changes are being introduced: It was overlooked that Transmogrifier will write a text file with records to delete as part of its output, and how this would be captured in the collating of records and deduping. In the case of records where the delete action was the last action, then they should be removed from the dataset. How this addresses that need: * Updates opinionations where a .json extension is assumed * Updates run_ab_transforms validation to look for output files that indicate Transmogrifier produced something as output Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-371
|
@jonavellecuerdo , @ehanson8 - an additional commit has been added that accounts for TXT files that Transmogrifier may produce if deleted records are present: ad9e732. If you would like to confirm this, you can perform this test run that includes a delete file from Alma: pipenv run abdiff --verbose run-diff \
-d output/jobs/dupes \
-i s3://timdex-extract-prod-300442551476/alma/alma-2024-04-10-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/alma/alma-2024-04-11-daily-extracted-records-to-delete.xmlThe run should complete successfully and no records from the TXT files with deleted records would be present. This is somewhat difficult to quickly confirm, but I have locally. The various stages in the collate process are now finding the deleted records via the TXT files and incorporating that into the logic about whether to include the record or not. Apologies again for this late add! |
| "transformed_file_name": transformed_file.split("/")[-1], | ||
| **base_record, | ||
| "timdex_record_id": row[1], | ||
| "record": None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is None, because we don't care about a deleted record's actual record body! If it's the last instance of that record in the run, then it will be removed entirely. Otherwise, the more recent version, which will have a record, will be utilized.
|
|
||
| # handle TXT files with records to delete | ||
| else: | ||
| deleted_records_df = pd.read_csv(transformed_file, header=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike a JSON file to iterate over, we just have a CSV with record IDs. So using pandas to quickly parse and loop through those values.
| base_record = { | ||
| "source": filename_details["source"], | ||
| "run_date": filename_details["run-date"], | ||
| "run_type": filename_details["run-type"], | ||
| "action": filename_details["action"], | ||
| "version": version, | ||
| "transformed_file_name": transformed_file.split("/")[-1], | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields in the collated dataset are shared between JSON and TXT files, so broken out here.
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good addition of processing the TXT files, no concerns!
| if ( | ||
| file_parts["source"] in version_file # type: ignore[operator] | ||
| and file_parts["run-date"] in version_file # type: ignore[operator] | ||
| and file_parts["run-type"] in version_file # type: ignore[operator] | ||
| and (not file_parts["index"] or file_parts["index"] in version_file) | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach allows for avoiding finicky regex to see if the input file has artifacts in the output files.
For example, we'd want to know that alma-2024-10-02-full exists in at least one file in the A/B output files. But... for alma it could have an index underscore like ..._01.xml so we'd need to confirm that as well.
If we think of each output filename as containing nuggets of information like source, run-date, run-type, index, etc., then it kind of makes sense that we could look for pieces of information independently, but required in the same filename.
This is obviously kind of a loopy, naive approach to doing this, but the scale of this makes it inconsequential; we're looking at max 2-3k input files, against max 4-6k output files, making this a 1-2 second check tops.
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking good! Just one question and a small change request.
|
@ghukill @ehanson8 Just thought I'd share a spike(ish) Confluence document I worked on to assist my review and understanding of the work in this PR. I was curious as to (a) how duplicates end up in the collated dataset and (b) the effect of "deleted" text files generated by |
Purpose and background context
This PR dedupes records during the collate step, to ensure only a single version of a
timdex_record_idis present in the final dataset for analysis. More specifically, the most recent and non-deleted record.See the mocked list of records that may have come from the transform step, where records were duplicated across input files.
Now see this test which asserts:
action='delete', then we exclude it entirelyWhy limit ourselves to a single record in ABDiff? For some sources, it's not uncommon to see 50-100 versions of the file over a given timespan. All of these different forms would have their diffs tallied individually, incorrectly suggesting that
titlefield (as an arbitrary example) was modified for 75 records, when in fact it was one record.It's not critical that we get the most recent and non-deleted version, but it's most true to the state of TIMDEX / S3, and allows us to pass unlimited input files, knowing we get only the version of the record present in TIMDEX.
NOTE: also includes a small commit to expect and mock transformed files as always have a
.jsonfile extension.How can a reviewer manually see the effects of these changes?
1- Set production AWS credentials in
.env2- Create a new job:
3- Run diff with ~62 input files (NOTING AGAIN: very long CLI command string. See TIMX-379 for possible improvements):
With this dataset created, enter DuckDB via this shell command:
Then run the following query and confirm that no duplicate records exist:
Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)