TIMX 371 - dedupe records #43

ghukill · 2024-10-29T17:28:03Z

Purpose and background context

This PR dedupes records during the collate step, to ensure only a single version of a timdex_record_id is present in the final dataset for analysis. More specifically, the most recent and non-deleted record.

See the mocked list of records that may have come from the transform step, where records were duplicated across input files.

Now see this test which asserts:

we always get the most recent version of a record
if the most recent version of a record was action='delete', then we exclude it entirely

Why limit ourselves to a single record in ABDiff? For some sources, it's not uncommon to see 50-100 versions of the file over a given timespan. All of these different forms would have their diffs tallied individually, incorrectly suggesting that title field (as an arbitrary example) was modified for 75 records, when in fact it was one record.

It's not critical that we get the most recent and non-deleted version, but it's most true to the state of TIMDEX / S3, and allows us to pass unlimited input files, knowing we get only the version of the record present in TIMDEX.

NOTE: also includes a small commit to expect and mock transformed files as always have a .json file extension.

How can a reviewer manually see the effects of these changes?

1- Set production AWS credentials in .env

2- Create a new job:

pipenv run abdiff --verbose init-job \
-d output/jobs/dupes \
-a 008e20c -b 395e612

3- Run diff with ~62 input files (NOTING AGAIN: very long CLI command string. See TIMX-379 for possible improvements):

pipenv run abdiff --verbose run-diff \
-d output/jobs/dupes \
-i s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-15-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-16-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-17-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-19-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-20-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-21-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-22-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-23-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-24-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-25-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-26-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-27-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-28-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-29-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-30-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-31-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-03-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-04-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-05-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-06-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-07-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-08-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-09-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-10-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-11-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-12-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-13-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-14-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-16-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-17-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-18-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-19-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-20-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-21-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-22-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-23-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-24-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-25-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-26-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-27-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-28-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-29-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-30-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-01-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-02-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-03-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-04-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-05-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-07-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-08-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-09-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-10-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-11-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-12-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-13-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-15-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-16-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-17-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-18-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-19-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-21-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-22-daily-extracted-records-to-index.xml

With this dataset created, enter DuckDB via this shell command:

duckdb

Then run the following query and confirm that no duplicate records exist:

-- create view to work with (changing the run_directory to yours with your timestamp)
create view collated as
select * from 
read_parquet(
    'output/jobs/dupes/runs/2024-10-29_17-51-50/collated/**/*.parquet',
    hive_partitioning=true
);

-- run query
select timdex_record_id
from collated
group by timdex_record_id
having count(timdex_record_id) > 1;

/*
┌──────────────────┐
│ timdex_record_id │
│     varchar      │
├──────────────────┤
│      0 rows      │
└──────────────────┘
*/

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-371

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

coveralls · 2024-10-29T17:52:26Z

Pull Request Test Coverage Report for Build 11629970056

Details

46 of 48 (95.83%) changed or added relevant lines in 2 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage decreased (-0.3%) to 96.132%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
abdiff/core/collate_ab_transforms.py	33	35	94.29%

Files with Coverage Reduction	New Missed Lines	%
abdiff/core/collate_ab_transforms.py	1	96.55%

Totals
Change from base Build 11579438090:	-0.3%
Covered Lines:	671
Relevant Lines:	698

💛 - Coveralls

Why these changes are being introduced: Ideally, we could provide files to ABDiff in the way they accumulate in S3 where full runs are followed by dailes, sometimes there are deletes, etc. But in doing so, we would only end up with the most recent, non-deleted version of the record in the final dataset to analyze. How this addresses that need: * Adds step in collate_ab_transforms to dedupe records based on timdex_record_id, run_date, run_type, and action Side effects of this change: * Analysis will only contain the most recent version of a record even if the record is duplicated amongst input files. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-371

ghukill · 2024-10-29T18:07:47Z

abdiff/core/collate_ab_transforms.py

+    def fetch_single_value(query: str) -> int:
+        result = con.execute(query).fetchone()
+        if result is None:
+            raise RuntimeError(f"Query returned no results: {query}")  # pragma: nocover
+        return int(result[0])


I think it's possible this could even be a more global abdiff.core.utils helper function... but this felt acceptable for the time being.

Using fetchone()[0] certainly makes sense logically, but type hinting doesn't like it. This was an attempt to remove a handful of typing and ruff ignores.

ehanson8

Looks good but one suggestion

ehanson8 · 2024-10-30T18:50:10Z

abdiff/core/collate_ab_transforms.py

+        if non_unique_count > 0:
+            raise OutputValidationError(
+                "The collated dataset contains duplicate 'timdex_record_id' records."
+            )


Unclear if this is too much to ask (please push back if so), but it seems it would be helpful to know which timdex_record_id is duplicated

I hear ya, but I would pushback, for reasons of scale. If this were not working correctly, it's conceivable that 10's, 100's, thousands of records could be duplicated. I would posit it's sufficient that if any records are duplicated, something is intrinsically wrong with the deduplication logic; it's that that needs attention, and not a specific record.

Counter-point: we could show a sample like 10 records. And then during debugging that work, you could look for those? But... I suppose my preference would be to skip that for now, unless we have trouble with this in the future.

Fine with me, the scale argument makes perfect sense! Agree there's no need to add unless we find it to be a problem

Why these changes are being introduced: It was overlooked that Transmogrifier will write a text file with records to delete as part of its output, and how this would be captured in the collating of records and deduping. In the case of records where the delete action was the last action, then they should be removed from the dataset. How this addresses that need: * Updates opinionations where a .json extension is assumed * Updates run_ab_transforms validation to look for output files that indicate Transmogrifier produced something as output Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-371

ghukill · 2024-10-30T20:54:01Z

@jonavellecuerdo , @ehanson8 - an additional commit has been added that accounts for TXT files that Transmogrifier may produce if deleted records are present: ad9e732.

If you would like to confirm this, you can perform this test run that includes a delete file from Alma:

pipenv run abdiff --verbose run-diff \
-d output/jobs/dupes \
-i s3://timdex-extract-prod-300442551476/alma/alma-2024-04-10-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/alma/alma-2024-04-11-daily-extracted-records-to-delete.xml

The run should complete successfully and no records from the TXT files with deleted records would be present. This is somewhat difficult to quickly confirm, but I have locally. The various stages in the collate process are now finding the deleted records via the TXT files and incorporating that into the logic about whether to include the record or not.

Apologies again for this late add!

ghukill · 2024-10-30T21:01:17Z

abdiff/core/collate_ab_transforms.py

-                "transformed_file_name": transformed_file.split("/")[-1],
+                **base_record,
+                "timdex_record_id": row[1],
+                "record": None,


This is None, because we don't care about a deleted record's actual record body! If it's the last instance of that record in the run, then it will be removed entirely. Otherwise, the more recent version, which will have a record, will be utilized.

ghukill · 2024-10-30T21:01:44Z

abdiff/core/collate_ab_transforms.py

+
+    # handle TXT files with records to delete
+    else:
+        deleted_records_df = pd.read_csv(transformed_file, header=None)


Unlike a JSON file to iterate over, we just have a CSV with record IDs. So using pandas to quickly parse and loop through those values.

ghukill · 2024-10-30T21:02:17Z

abdiff/core/collate_ab_transforms.py

+    base_record = {
+        "source": filename_details["source"],
+        "run_date": filename_details["run-date"],
+        "run_type": filename_details["run-type"],
+        "action": filename_details["action"],
+        "version": version,
+        "transformed_file_name": transformed_file.split("/")[-1],
+    }


These fields in the collated dataset are shared between JSON and TXT files, so broken out here.

ehanson8

Good addition of processing the TXT files, no concerns!

ghukill · 2024-10-31T13:58:53Z

abdiff/core/run_ab_transforms.py

+                if (
+                    file_parts["source"] in version_file  # type: ignore[operator]
+                    and file_parts["run-date"] in version_file  # type: ignore[operator]
+                    and file_parts["run-type"] in version_file  # type: ignore[operator]
+                    and (not file_parts["index"] or file_parts["index"] in version_file)
+                ):


This approach allows for avoiding finicky regex to see if the input file has artifacts in the output files.

For example, we'd want to know that alma-2024-10-02-full exists in at least one file in the A/B output files. But... for alma it could have an index underscore like ..._01.xml so we'd need to confirm that as well.

If we think of each output filename as containing nuggets of information like source, run-date, run-type, index, etc., then it kind of makes sense that we could look for pieces of information independently, but required in the same filename.

This is obviously kind of a loopy, naive approach to doing this, but the scale of this makes it inconsequential; we're looking at max 2-3k input files, against max 4-6k output files, making this a 1-2 second check tops.

jonavellecuerdo

I think this is looking good! Just one question and a small change request.

tests/conftest.py

tests/test_collate_ab_transforms.py

jonavellecuerdo · 2024-11-01T13:08:10Z

@ghukill @ehanson8 Just thought I'd share a spike(ish) Confluence document I worked on to assist my review and understanding of the work in this PR. I was curious as to (a) how duplicates end up in the collated dataset and (b) the effect of "deleted" text files generated by transmogrifier, and I wanted to get a better understanding of how the data from "transformed files" (which I now understand to be JSON files containing TIMDEX records and text files containing TIMDEX record IDs marked as deleted) flows from run_ab_transforms to collate_ab_transforms. Here is the document: [spike] Transmog A/B Diff: Deduping records when collating datasets. After going through this walkthrough, I developed a better understanding of why the proposed updates in this PR work. 🎉

ghukill force-pushed the TIMX-371-dedupe-records branch from d944b6b to 76ec3b8 Compare October 29, 2024 17:49

ghukill marked this pull request as ready for review October 29, 2024 17:56

ghukill requested review from ehanson8 and jonavellecuerdo October 29, 2024 17:56

ghukill added 2 commits October 29, 2024 14:01

Transformed files always JSON files

2b03153

ghukill force-pushed the TIMX-371-dedupe-records branch from 76ec3b8 to 2b03153 Compare October 29, 2024 18:02

ghukill commented Oct 29, 2024

View reviewed changes

ehanson8 reviewed Oct 30, 2024

View reviewed changes

ghukill commented Oct 30, 2024

View reviewed changes

ehanson8 approved these changes Oct 31, 2024

View reviewed changes

ghukill commented Oct 31, 2024

View reviewed changes

jonavellecuerdo reviewed Nov 1, 2024

View reviewed changes

tests/conftest.py Outdated Show resolved Hide resolved

tests/test_collate_ab_transforms.py Outdated Show resolved Hide resolved

Cleanup collate test fixtures and order

875be3c

ghukill requested a review from jonavellecuerdo November 1, 2024 13:17

jonavellecuerdo approved these changes Nov 1, 2024

View reviewed changes

ghukill merged commit a4b1a13 into main Nov 1, 2024
2 checks passed

ghukill deleted the TIMX-371-dedupe-records branch November 5, 2024 14:21

TIMX 371 - dedupe records #43

TIMX 371 - dedupe records #43

Uh oh!

Conversation

ghukill commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

Uh oh!

coveralls commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 11629970056

Details

💛 - Coveralls

Uh oh!

ghukill Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

ehanson8 Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

ghukill Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

ehanson8 Oct 31, 2024

Choose a reason for hiding this comment

Uh oh!

ghukill commented Oct 30, 2024

Uh oh!

ghukill Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghukill Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

ghukill Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

ghukill Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jonavellecuerdo commented Nov 1, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ghukill commented Oct 29, 2024 •

edited

Loading

coveralls commented Oct 29, 2024 •

edited

Loading

ghukill Oct 30, 2024 •

edited

Loading

ghukill Oct 30, 2024 •

edited

Loading

ghukill Oct 31, 2024 •

edited

Loading