-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use REST APIs to resolve DOIs + cleanup dataverse provider #1390
Merged
Merged
Changes from 19 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
91242e5
Use the doi.org API to resolve URLs
yuvipanda 52eeb8f
Stop mocking dataverse contentprovider test
yuvipanda bf40856
Fix text fixtures
yuvipanda 3eab292
Merge branch 'integ' into use-api
yuvipanda 172f8b0
[WIP] Cleanup dataverse contentprovider
yuvipanda 1260a5a
Support fetcing single files in dataverse
yuvipanda b7050ba
Always fetch entire dataset for dataverse
yuvipanda fde74ef
Fix content_id for dataverse URLs
yuvipanda 96057f9
Use List from typing
yuvipanda fda5339
Use hash for content_id
yuvipanda f6037ca
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] b854b77
Fix tests
yuvipanda f9e3d70
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 53fba84
Add note about supporting /citation
yuvipanda f4d58dc
Describe what kind of DOI is being returned
yuvipanda d71efb8
Fix figshare unit test
yuvipanda f7dfff1
Fix hydroshare tests
yuvipanda 3be6ca9
Fix zenodo tests
yuvipanda 60c0d70
Fix doi provider test
yuvipanda e48f5b7
Switch back to using DOI as persistent_id
yuvipanda File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a risk that a 64 character hash might make image names too long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/opencontainers/distribution-spec/blob/main/spec.md#pulling-manifests says many implementations limit the hostname and name of the image in total to 256 chars. I think this means it may be good enough and not a problem?
Alternatively, I can go back to parsing persistent_id at
detect
time, instead of atfetch
time, and set it that way. I think part of the confusion here is arounddetect
semantics and whencontent_id
is called. Ideallydetect
should be stateless and be simply used to, well, detect things! But we seem to treat it as also the thing that sets.content_id
so it's a little bit of a mess. I'm happy to treat that as a different refactor though.Choice to be made
detect
Happy to do either!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that certainly doesn't sound right. It looks to me like we also only access content_id after calling fetch. Is it possible that the issue you are seeing is only in the tests, not how r2d actually behaves? What happens if you
raise
in content_id if fetch hasn't been called?If it's just the tests and persistent_id is defined after fetch, then keeping persistent_id seems nice here, and maybe we can fix the tests to be more realistic. And make it explicit that
content_id
cannot be assumed to be available untilfetch
has been called?A tangent I went on about hash length, that I'm not sure is relevant anymore, but already wrote down. Feel free to ignore:
Initially, I thought the content id was the full thing, but of course it's the 'ref' that goes after the doi itself. Running a test gives this 106-character image name:
Since we're in the namespace of the doi, collision probability is super low. We truncate to the short hash in git. So maybe truncate this hash, or use a natively shorter hash function like:
(blake2 is in
hashlib.algorithms_guaranteed
since 3.6, I think)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooooh, fixing the tests seems the right thing to do! I'll take a peek.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey, it looks like I already fixed the tests, so it's all fine now! Back to using persistent_id as the identifier, but this time it's 'proper' - so if we get a file persistent_id, we resolve it to the dataset persistent id, and use that. so if multiple different folks try to use different files from the same dataset, it will lead to cache reuse now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! All looks right to me, then.