Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Nov 7, 2024

Purpose and background context

This PR reworks how to review individual records from a run. This is accomplished by removing the "Record Samples" section in the Run page, which pointed to a standalone, static HTML page of Record links, and replacing it with a single, filterable, browsable table of all records.

The previous approach was functional, but had some drawbacks. The "Samples" page was static HTML, and did not lend itself to large numbers of records. So, a "sample" of Records that matched the criteria was displayed, but this necessarily was not showing all records that matched that critiera. And, this approach had no inroads for viewing records with no diff, where it could still be helpful to look at the A/B versions, even if identical. The list goes on.

The approach taken here is a single table, capable of handling huge numbers of records. By layering on filtering and searching, it avoids the issue of thinking of combinations in advance of records to view. It also -- by default -- shows all records, allowing for a more browsing experience. The following is a screenshot of what it looks like:

Screenshot 2024-11-07 at 3 47 24 PM

The Javascript library DataTables is used for this table, which is highly configurable. The approach of "server-side" data preparation is used, so our Flask webapp is doing the heavy lifting of performing queries, sorting results, etc. The actual table then just makes lightweight requests to the flask data endpoint each time information is needed, or filters applied, sorting, etc. In this way, while the records dataset may contain 4-5m records, we'd only ever be sending 10-100 records (depending on "Records per Page" setting) to the table itself.

How can a reviewer manually see the effects of these changes?

1- Set production AWS credentials

2- Init job (where 4fcb617 is an fairly old commit that generates some diffs):

pipenv run abdiff --verbose init-job -d output/jobs/records -a 4fcb617 -b 3f77c7c

3- Generate a CSV of input files and perform a run:

pipenv run abdiff --verbose timdex-sources-csv -o output/jobs/records/springshare.csv -s libguides,researchdatabases
pipenv run abdiff --verbose run-diff -d output/jobs/records -i output/jobs/records/springshare.csv

4- View the job:

pipenv run abdiff --verbose view-job -d output/jobs/records

Here are some things to try with the "Records" table at the bottom of the run page. Recommended to click "Reset Filters" button between them (NOTE: the file counts are approximate, given changes in records depending on when run):

  • Without any filters, note records count is 1,290
    • click source libguides, note count goes down to 372
    • hold ctrl (or cmd?) and click researchdatabases, note count goes back up to 1,290 because effectively all records again
    • this is multi-entry selection, which works for fields as well
  • Click source researchdatabases + field publishers, note that zero records match this combo
  • Enter "Chemical Engineering" into the search box, note that two records show up
    • this was fulltext searching the actual A and B records, so not obvious where it matched, but clicking into either record will reveal it's somewhere in the record body
  • Enter "libguides:guides-176330" in search box, one result comes back, showing how it could be used for timdex_record_id as well, as it's part of the source record

Includes new or updated dependencies?

YES: but only in the webapp, the DataTables javascript and CSS files

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

@ghukill ghukill force-pushed the TIMX-385-records-preview branch from 21e469c to 7b7c0e6 Compare November 7, 2024 21:23
@coveralls
Copy link

coveralls commented Nov 7, 2024

Pull Request Test Coverage Report for Build 11801095979

Details

  • 52 of 57 (91.23%) changed or added relevant lines in 3 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+1.4%) to 86.688%

Changes Missing Coverage Covered Lines Changed/Added Lines %
abdiff/webapp/app.py 5 10 50.0%
Files with Coverage Reduction New Missed Lines %
abdiff/webapp/app.py 1 88.73%
Totals Coverage Status
Change from base Build 11800813131: 1.4%
Covered Lines: 827
Relevant Lines: 954

💛 - Coveralls

@ghukill ghukill force-pushed the TIMX-385-records-preview branch from 7b7c0e6 to 6d28eb0 Compare November 12, 2024 15:54
@ghukill ghukill marked this pull request as ready for review November 12, 2024 15:59
Why these changes are being introduced:

One affordance of the webapp is viewing individual records from
a run, seeing a summary of A/B differences, the full A/B records,
and a side-by-side comparison.  But when runs may contain lots of
records, some kind of interface is needed to identify records for
viewing (where once the timdex_record_id is known, the Record page
only requires that).

Previously, a "Record Samples" section appeared on the Run page
that linked to a standalone table of records that met some kind
of criteria (e.g. source = X, or field Y has diffs).  This was
functional, but had drawbacks:

- this static HTML could not handle large number of records, meaning
a representative sample was used, which prevented access to all records
- the combinations of dimensions to drill down into was limited by the
static sample pages of records

How this addresses that need:
* Removes all "Record Samples" approaches
* Replaces with a single table in the Run page
  * filterable by source, whether records had modified specific modified
    fields, even full-text search of the records themselves
* This single table provides a mechanism to browse and filter records
from the run, from a single interface, with arguably simpler logic
under the hood to power it

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-385
@ghukill ghukill force-pushed the TIMX-385-records-preview branch from 6d28eb0 to b9c9b72 Compare November 12, 2024 16:05
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my review, I followed the instructions outlined in the PR description, and I cannot express how awesome these changes are! I was amazed at how quickly it was able to run queries against the record dataset and how user-friendly the interface is. Excellent job. I did pose one question + potential request for change!

@ghukill
Copy link
Contributor Author

ghukill commented Nov 12, 2024

For my review, I followed the instructions outlined in the PR description, and I cannot express how awesome these changes are! I was amazed at how quickly it was able to run queries against the record dataset and how user-friendly the interface is. Excellent job. I did pose one question + potential request for change!

Thanks! FWIW, once we get into the 3-4m records, I think we'll see a significant delay in the responsiveness of the table. At that time, we can see if we need optimizations, or if it's just 1-2 seconds, perhaps that is okay for this diagnostic tool.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@ghukill ghukill merged commit 8f14417 into main Nov 12, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants