Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Oct 29, 2024

Purpose and background context

This PR dedupes records during the collate step, to ensure only a single version of a timdex_record_id is present in the final dataset for analysis. More specifically, the most recent and non-deleted record.

See the mocked list of records that may have come from the transform step, where records were duplicated across input files.

Now see this test which asserts:

  • we always get the most recent version of a record
  • if the most recent version of a record was action='delete', then we exclude it entirely

Why limit ourselves to a single record in ABDiff? For some sources, it's not uncommon to see 50-100 versions of the file over a given timespan. All of these different forms would have their diffs tallied individually, incorrectly suggesting that title field (as an arbitrary example) was modified for 75 records, when in fact it was one record.

It's not critical that we get the most recent and non-deleted version, but it's most true to the state of TIMDEX / S3, and allows us to pass unlimited input files, knowing we get only the version of the record present in TIMDEX.

NOTE: also includes a small commit to expect and mock transformed files as always have a .json file extension.

How can a reviewer manually see the effects of these changes?

1- Set production AWS credentials in .env

2- Create a new job:

pipenv run abdiff --verbose init-job \
-d output/jobs/dupes \
-a 008e20c -b 395e612

3- Run diff with ~62 input files (NOTING AGAIN: very long CLI command string. See TIMX-379 for possible improvements):

pipenv run abdiff --verbose run-diff \
-d output/jobs/dupes \
-i s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-15-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-16-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-17-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-19-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-20-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-21-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-22-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-23-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-24-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-25-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-26-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-27-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-28-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-29-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-30-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-08-31-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-03-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-04-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-05-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-06-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-07-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-08-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-09-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-10-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-11-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-12-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-13-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-14-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-16-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-17-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-18-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-19-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-20-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-21-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-22-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-23-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-24-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-25-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-26-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-27-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-28-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-29-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-09-30-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-01-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-02-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-03-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-04-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-05-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-07-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-08-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-09-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-10-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-11-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-12-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-13-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-15-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-16-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-17-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-18-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-19-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-21-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/libguides/libguides-2024-10-22-daily-extracted-records-to-index.xml

With this dataset created, enter DuckDB via this shell command:

duckdb

Then run the following query and confirm that no duplicate records exist:

-- create view to work with (changing the run_directory to yours with your timestamp)
create view collated as
select * from 
read_parquet(
    'output/jobs/dupes/runs/2024-10-29_17-51-50/collated/**/*.parquet',
    hive_partitioning=true
);

-- run query
select timdex_record_id
from collated
group by timdex_record_id
having count(timdex_record_id) > 1;

/*
┌──────────────────┐
│ timdex_record_id │
│     varchar      │
├──────────────────┤
│      0 rows      │
└──────────────────┘
*/

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

@ghukill ghukill force-pushed the TIMX-371-dedupe-records branch from d944b6b to 76ec3b8 Compare October 29, 2024 17:49
@coveralls
Copy link

coveralls commented Oct 29, 2024

Pull Request Test Coverage Report for Build 11629970056

Details

  • 46 of 48 (95.83%) changed or added relevant lines in 2 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.3%) to 96.132%

Changes Missing Coverage Covered Lines Changed/Added Lines %
abdiff/core/collate_ab_transforms.py 33 35 94.29%
Files with Coverage Reduction New Missed Lines %
abdiff/core/collate_ab_transforms.py 1 96.55%
Totals Coverage Status
Change from base Build 11579438090: -0.3%
Covered Lines: 671
Relevant Lines: 698

💛 - Coveralls

@ghukill ghukill marked this pull request as ready for review October 29, 2024 17:56
Why these changes are being introduced:

Ideally, we could provide files to ABDiff in the way they
accumulate in S3 where full runs are followed by dailes,
sometimes there are deletes, etc.  But in doing so, we would
only end up with the most recent, non-deleted version of
the record in the final dataset to analyze.

How this addresses that need:
* Adds step in collate_ab_transforms to dedupe records
based on timdex_record_id, run_date, run_type, and action

Side effects of this change:
* Analysis will only contain the most recent version of a record
even if the record is duplicated amongst input files.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-371
@ghukill ghukill force-pushed the TIMX-371-dedupe-records branch from 76ec3b8 to 2b03153 Compare October 29, 2024 18:02
Comment on lines +311 to +315
def fetch_single_value(query: str) -> int:
result = con.execute(query).fetchone()
if result is None:
raise RuntimeError(f"Query returned no results: {query}") # pragma: nocover
return int(result[0])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible this could even be a more global abdiff.core.utils helper function... but this felt acceptable for the time being.

Using fetchone()[0] certainly makes sense logically, but type hinting doesn't like it. This was an attempt to remove a handful of typing and ruff ignores.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but one suggestion

Comment on lines +359 to +362
if non_unique_count > 0:
raise OutputValidationError(
"The collated dataset contains duplicate 'timdex_record_id' records."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear if this is too much to ask (please push back if so), but it seems it would be helpful to know which timdex_record_id is duplicated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear ya, but I would pushback, for reasons of scale. If this were not working correctly, it's conceivable that 10's, 100's, thousands of records could be duplicated. I would posit it's sufficient that if any records are duplicated, something is intrinsically wrong with the deduplication logic; it's that that needs attention, and not a specific record.

Counter-point: we could show a sample like 10 records. And then during debugging that work, you could look for those? But... I suppose my preference would be to skip that for now, unless we have trouble with this in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me, the scale argument makes perfect sense! Agree there's no need to add unless we find it to be a problem

Why these changes are being introduced:

It was overlooked that Transmogrifier will write a text file with
records to delete as part of its output, and how this would be
captured in the collating of records and deduping.  In the case of
records where the delete action was the last action, then they
should be removed from the dataset.

How this addresses that need:
* Updates opinionations where a .json extension is assumed
* Updates run_ab_transforms validation to look for output files
that indicate Transmogrifier produced something as output

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-371
@ghukill
Copy link
Contributor Author

ghukill commented Oct 30, 2024

@jonavellecuerdo , @ehanson8 - an additional commit has been added that accounts for TXT files that Transmogrifier may produce if deleted records are present: ad9e732.

If you would like to confirm this, you can perform this test run that includes a delete file from Alma:

pipenv run abdiff --verbose run-diff \
-d output/jobs/dupes \
-i s3://timdex-extract-prod-300442551476/alma/alma-2024-04-10-daily-extracted-records-to-index.xml,s3://timdex-extract-prod-300442551476/alma/alma-2024-04-11-daily-extracted-records-to-delete.xml

The run should complete successfully and no records from the TXT files with deleted records would be present. This is somewhat difficult to quickly confirm, but I have locally. The various stages in the collate process are now finding the deleted records via the TXT files and incorporating that into the logic about whether to include the record or not.

Apologies again for this late add!

"transformed_file_name": transformed_file.split("/")[-1],
**base_record,
"timdex_record_id": row[1],
"record": None,
Copy link
Contributor Author

@ghukill ghukill Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is None, because we don't care about a deleted record's actual record body! If it's the last instance of that record in the run, then it will be removed entirely. Otherwise, the more recent version, which will have a record, will be utilized.


# handle TXT files with records to delete
else:
deleted_records_df = pd.read_csv(transformed_file, header=None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike a JSON file to iterate over, we just have a CSV with record IDs. So using pandas to quickly parse and loop through those values.

Comment on lines +124 to +131
base_record = {
"source": filename_details["source"],
"run_date": filename_details["run-date"],
"run_type": filename_details["run-type"],
"action": filename_details["action"],
"version": version,
"transformed_file_name": transformed_file.split("/")[-1],
}
Copy link
Contributor Author

@ghukill ghukill Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fields in the collated dataset are shared between JSON and TXT files, so broken out here.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition of processing the TXT files, no concerns!

Comment on lines +314 to +319
if (
file_parts["source"] in version_file # type: ignore[operator]
and file_parts["run-date"] in version_file # type: ignore[operator]
and file_parts["run-type"] in version_file # type: ignore[operator]
and (not file_parts["index"] or file_parts["index"] in version_file)
):
Copy link
Contributor Author

@ghukill ghukill Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach allows for avoiding finicky regex to see if the input file has artifacts in the output files.

For example, we'd want to know that alma-2024-10-02-full exists in at least one file in the A/B output files. But... for alma it could have an index underscore like ..._01.xml so we'd need to confirm that as well.

If we think of each output filename as containing nuggets of information like source, run-date, run-type, index, etc., then it kind of makes sense that we could look for pieces of information independently, but required in the same filename.

This is obviously kind of a loopy, naive approach to doing this, but the scale of this makes it inconsequential; we're looking at max 2-3k input files, against max 4-6k output files, making this a 1-2 second check tops.

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking good! Just one question and a small change request.

@jonavellecuerdo
Copy link
Contributor

@ghukill @ehanson8 Just thought I'd share a spike(ish) Confluence document I worked on to assist my review and understanding of the work in this PR. I was curious as to (a) how duplicates end up in the collated dataset and (b) the effect of "deleted" text files generated by transmogrifier, and I wanted to get a better understanding of how the data from "transformed files" (which I now understand to be JSON files containing TIMDEX records and text files containing TIMDEX record IDs marked as deleted) flows from run_ab_transforms to collate_ab_transforms. Here is the document: [spike] Transmog A/B Diff: Deduping records when collating datasets. After going through this walkthrough, I developed a better understanding of why the proposed updates in this PR work. 🎉

@ghukill ghukill merged commit a4b1a13 into main Nov 1, 2024
2 checks passed
@ghukill ghukill deleted the TIMX-371-dedupe-records branch November 5, 2024 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants