Include dataset hash in job search #19112

mvdbeek · 2024-11-06T10:52:00Z

This means we find equivalent jobs if an input hda either points at the same dataset id (existing behavior), or if the dataset ~~source_uri, transform and~~ hashes match. All further restrictions still apply (same metadata etc).

Builds on #19108 and #19110 .

Also:

Drop test_run_deferred_dataset and test_run_deferred_dataset_with_metadata_options_filter
tests from lib/galaxy_test/api/test_tools.py which are duplicates
of test_deferred_with_cached_input and test_deferred_with_metadata_options_filter in
lib/galaxy_test/api/test_tool_execute.py .
Improve type annotations

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

bgruening · 2024-11-06T21:49:30Z

Running a Galaxy Training Academy without large computational resources - yeah, yeah!

jmchilton · 2024-11-07T18:32:33Z

We have no clue about extra files in this context right - how can we make this conclusion without that? It isn't a -1 but I am uncomfortable with the direction this is all heading. I feel like we should be working on tightening up job search and not loosening it more and more before we've installed the guard rails it needs IMO.

jmchilton · 2024-11-07T18:36:27Z

Maybe to clarify this - I absolutely think we should be able to hash the input to be used with job search. But I don't think this hash is what I would use - I would implement a hash of the dataset and not of the primary file of the dataset. 95% of the time they could be the same but we should verify that before using it in this context.

lib/galaxy_test/api/test_tools.py

mvdbeek · 2024-11-08T08:33:30Z

I would implement a hash of the dataset and not of the primary file of the dataset. 95% of the time they could be the same but we should verify that before using it in this context.

If we match the transform would it still be possible to get different extra files for uploaded content (considering we already match on the datatype) ? I added 4abc6e6 because we don't record the transform for local files, but we could of course do that.

There's of course also value in calculating the hash for datasets as they are written to the object store, but if we can get reliable equivalence using the upload hash + transform we could make nice progress in figuring out other edge cases for IWC workflows ?

jmchilton · 2024-11-08T12:18:38Z

I had the realization reading your response that you only care about upload or think these fields can all only be set during uploads. I believe any tool can produce hashes and source hashes and I believe we have an API for hashing files after the fact anyway. Could we restrict all this logic to just fetch and upload1 - then it feels pretty close to being correct? This also probably explains my unease with #19110 - I think we don't validate the hashes outside the fetch tool but we can create the hashes in other tools I think - it feels like what we need is a hash validated field if we want to act on that data in this fashion but maybe it would be sufficient to be more proactive in validating the hash field whenever it is set.

nsoranzo · 2024-11-22T00:06:37Z

You may want to rebase this now that #19181 has been merged (thanks @jmchilton !), since it contained some type annotation fixes copied from here.

If we match the transform would it still be possible to get different extra files for uploaded content (considering we already match on the datatype) ?

Not sure it's the same problem, but in #19181 I found out that a type of transform (grooming) is not reproducible (because the BAM header contains paths as part of the samtools command used to sort the data), so you'd get different hashes if you materialize twice from the same source.

So I think I agree it's safer to match on the actual dataset hashes instead of the dataset source hashes.

mvdbeek · 2024-11-22T09:36:38Z

Uhm, I'm confused, is that not what I'm doing ?

nsoranzo · 2024-11-22T11:56:28Z

Uhm, I'm confused, is that not what I'm doing ?

Ah, yes, I guess I was confused by the discussion above and the PR title, shouldn't it be "Include dataset hash in job search"? "Source hash" hints at the DatasetSourceHash model class to me (instead of DatasetHash, which is what you are using).

nsoranzo · 2024-12-12T15:39:38Z

@mvdbeek Still draft?

mvdbeek · 2024-12-12T15:56:33Z

I think so ? I need to think about all the ways it could break, and extra files is a valid concern, I think ?

nsoranzo · 2024-12-12T16:22:38Z

I think so ? I need to think about all the ways it could break, and extra files is a valid concern, I think ?

Right, should we only use dataset hashes if there are no extra files for the moment?

mvdbeek · 2024-12-12T16:33:07Z

That sounds good to me. I'm hacking away on something else right now, if you want to add to the PR that would be really cool, otherwise I'll come back to it later.

Fix the following test failure in galaxyproject#19112 : ``` FAILED lib/galaxy_test/api/test_jobs.py::TestJobsApi::test_delete_job - assert 1 == 0 + where 1 = len([{'model_class': 'Job', 'id': 'ebdef174148ecc74', 'history_id': '4dbec071fa7bdbaa', 'tool_id': 'cat1', 'state': 'ok', 'exit_code': 0, 'create_time': '2025-11-03T22:13:05.770038', 'update_time': '2025-11-03T22:13:07.098024', 'galaxy_version': '26.0', 'external_id': None, 'handler': None, 'job_runner_name': None, 'command_line': None, 'user_email': None, 'user_id': 'adb5f5c93f827949', 'command_version': '', 'params': {'queries': '[]', 'chromInfo': '"/home/runner/work/galaxy/galaxy/galaxy root/tool-data/shared/ucsc/chrom/?.len"', 'dbkey': '"?"', '__input_ext': '"input"'}, 'inputs': {'input1': {'id': 'cda8ec5455a55990', 'src': 'hda', 'uuid': 'e7979ca3-5df9-4d8d-8654-a8db8291a6c4'}}, 'outputs': {'out_file1': {'id': '3d0ca2420d1410fd', 'src': 'hda', 'uuid': 'fbeff3c3-12f5-40bf-aa6a-ea536e482938'}}, 'copied_from_job_id': None, 'output_collections': {}}]) + where [{'model_class': 'Job', 'id': 'ebdef174148ecc74', 'history_id': '4dbec071fa7bdbaa', 'tool_id': 'cat1', 'state': 'ok', 'exit_code': 0, 'create_time': '2025-11-03T22:13:05.770038', 'update_time': '2025-11-03T22:13:07.098024', 'galaxy_version': '26.0', 'external_id': None, 'handler': None, 'job_runner_name': None, 'command_line': None, 'user_email': None, 'user_id': 'adb5f5c93f827949', 'command_version': '', 'params': {'queries': '[]', 'chromInfo': '"/home/runner/work/galaxy/galaxy/galaxy root/tool-data/shared/ucsc/chrom/?.len"', 'dbkey': '"?"', '__input_ext': '"input"'}, 'inputs': {'input1': {'id': 'cda8ec5455a55990', 'src': 'hda', 'uuid': 'e7979ca3-5df9-4d8d-8654-a8db8291a6c4'}}, 'outputs': {'out_file1': {'id': '3d0ca2420d1410fd', 'src': 'hda', 'uuid': 'fbeff3c3-12f5-40bf-aa6a-ea536e482938'}}, 'copied_from_job_id': None, 'output_collections': {}}] = json() + where json = <Response [200]>.json ``` where searching for jobs has now started returning jobs with the same tool_id and input hashes but from different histories.

bgruening · 2026-01-12T14:21:08Z

A few trainers are testing the job cache with 25.1 and so far it works great. However, we see a lot of confusion. "Why people can not upload data from Zenodo etc" .... "why they need to use a shared history or DL" ... I think this PR is very important to streamline the job-cache usage and make it just easier for trainers to make use of it.

nsoranzo · 2026-01-26T15:37:57Z

Right, should we only use dataset hashes if there are no extra files for the moment?
That sounds good to me.

After some back-and-forth, I've switched to calculate dataset hashes also for the extra files and extend has_same_hash() to also match those hashes in addition to the primary file. Claude helped with writing the SqlAlchemy code changes.

I think this PR is very important to streamline the job-cache usage and make it just easier for trainers to make use of it.

It would indeed be great to have this merged for 26.0, I've marked it as ready for review.
Remaining test failures seem unrelated.

mvdbeek · 2026-01-26T18:56:55Z

lib/galaxy/jobs/__init__.py

                        compute_dataset_hash.delay(request=request)

+                        # For composite datasets with extra files, hash each extra file individually
+                        if dataset.extra_files_path_exists():


these can be thousands of files, can we at least smash this into a single task ? And maybe record a hash of hashes so the query won't have thousands of comparisons ?

mvdbeek · 2026-01-26T19:21:17Z

lib/galaxy/managers/jobs.py

+                .group_by(b_hash.dataset_id)
+                .having(
+                    and_(
+                        # Number of matched hashes equals total hashes in A


I am so worried about this bringing down the database ...

What if we calculate and store in the database (also) a "combined_hash", i.e. a hash of all the hashes of the dataset primary and extra files? We could use a similar approach also for a dataset collection (whose hash would be the combined hash of the combined hashes of its components).
Then we would be able to simplify the database query.

Can extra files change when you change the dataytype ? I think they can ? I don't know if it's wise to include the extra file stuff in a first pass, it is complex and i'm not sure that even just including the hashes isn't a large query penalty.

It's the primary file that change for some datatypes (e.g. Rgenetics) when setting metadata, not sure about the extra files.

mvdbeek · 2026-01-26T19:26:41Z

I am uncertain that this is always correct and worried about the query overhead we're adding. It also won't work for collections which I guess is ok for a first pass.

mvdbeek · 2026-01-26T19:33:46Z

Could we restrict all this logic to just fetch and upload1

having thought for a long time about this I think that's the right thing to do ? Replace the dataset either up front if we have an expected checksum (this already works for materialization, we can add a button to the GTN for that) or after the download when we have the checksum. An implementation of this was in 5cd059a

by computing and storing hashes for all files (primary file and extra files). Enhanced ``has_same_hash()`` to match datasets only when all hashes match, preventing partial matches.

github-actions bot added area/testing area/database Galaxy's database or data access layer area/datatypes area/testing/api labels Nov 6, 2024

nsoranzo added BioHackEU24 kind/enhancement labels Nov 6, 2024

This comment was marked as resolved.

Sign in to view

mvdbeek force-pushed the include_hash_in_job_search branch from b3ab7ac to ac4bba0 Compare November 6, 2024 15:14

mvdbeek changed the title ~~Include source_uri, transform and hash in job search~~ Include source hash in job search Nov 7, 2024

nsoranzo reviewed Nov 8, 2024

View reviewed changes

lib/galaxy_test/api/test_tools.py Outdated Show resolved Hide resolved

mvdbeek changed the title ~~Include source hash in job search~~ Include dataset hash in job search Nov 22, 2024

bgruening added this to the 25.0 milestone Feb 20, 2025

mvdbeek mentioned this pull request Mar 24, 2025

Fix various mypy issues around mapped attributes #19883

Merged

4 tasks

mvdbeek modified the milestones: 25.0, 25.1 May 13, 2025

mvdbeek added this to GBCC2025 Live Planning Board May 13, 2025

afgane moved this to In Progress in GBCC2025 Live Planning Board May 28, 2025

nsoranzo force-pushed the include_hash_in_job_search branch from d1a7734 to 6e12ed1 Compare September 22, 2025 13:08

mvdbeek force-pushed the include_hash_in_job_search branch from 177b66f to dee9c9c Compare November 4, 2025 12:41

nsoranzo mentioned this pull request Nov 6, 2025

Allow filtering job searches by history ID #21257

Merged

4 tasks

nsoranzo force-pushed the include_hash_in_job_search branch 2 times, most recently from 0fb9a09 to 790fdd9 Compare November 7, 2025 15:47

nsoranzo force-pushed the include_hash_in_job_search branch 2 times, most recently from 5943450 to 09f64a4 Compare January 12, 2026 14:49

nsoranzo force-pushed the include_hash_in_job_search branch 6 times, most recently from 8cddb94 to 67a4689 Compare January 26, 2026 12:20

nsoranzo added this to the 26.0 milestone Jan 26, 2026

nsoranzo marked this pull request as ready for review January 26, 2026 13:10

mvdbeek commented Jan 26, 2026

View reviewed changes

guerler modified the milestones: 26.0, 26.1 Jan 27, 2026

nsoranzo mentioned this pull request Jan 27, 2026

Type annotations and refactorings #21673

Merged

4 tasks

mvdbeek and others added 2 commits January 27, 2026 19:40

Include dataset hash in job search

40ab1bb

Enable hash-based job matching for composite datasets

40a0820

by computing and storing hashes for all files (primary file and extra files). Enhanced ``has_same_hash()`` to match datasets only when all hashes match, preventing partial matches.

nsoranzo force-pushed the include_hash_in_job_search branch from 67a4689 to 40a0820 Compare January 27, 2026 19:40

nsoranzo mentioned this pull request Jan 28, 2026

Plan: Implement hash-based job matching #21676

Open

Include dataset hash in job search #19112

Are you sure you want to change the base?

Include dataset hash in job search #19112

Uh oh!

Conversation

mvdbeek commented Nov 6, 2024 • edited by nsoranzo Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to test the changes?

License

Uh oh!

This comment was marked as resolved.

bgruening commented Nov 6, 2024

Uh oh!

jmchilton commented Nov 7, 2024

Uh oh!

jmchilton commented Nov 7, 2024

Uh oh!

Uh oh!

mvdbeek commented Nov 8, 2024

Uh oh!

jmchilton commented Nov 8, 2024

Uh oh!

nsoranzo commented Nov 22, 2024

Uh oh!

mvdbeek commented Nov 22, 2024

Uh oh!

nsoranzo commented Nov 22, 2024

Uh oh!

nsoranzo commented Dec 12, 2024

Uh oh!

mvdbeek commented Dec 12, 2024

Uh oh!

nsoranzo commented Dec 12, 2024

Uh oh!

mvdbeek commented Dec 12, 2024

Uh oh!

bgruening commented Jan 12, 2026

Uh oh!

nsoranzo commented Jan 26, 2026

Uh oh!

mvdbeek Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mvdbeek Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

nsoranzo Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

mvdbeek Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

nsoranzo Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

mvdbeek commented Jan 26, 2026

Uh oh!

mvdbeek commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mvdbeek commented Nov 6, 2024 •

edited by nsoranzo

Loading

mvdbeek Jan 26, 2026 •

edited

Loading

mvdbeek commented Jan 26, 2026 •

edited

Loading