Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use REST APIs to resolve DOIs + cleanup dataverse provider #1390
Use REST APIs to resolve DOIs + cleanup dataverse provider #1390
Changes from 11 commits
91242e5
52eeb8f
bf40856
3eab292
172f8b0
1260a5a
b7050ba
fde74ef
96057f9
fda5339
f6037ca
b854b77
f9e3d70
53fba84
f4d58dc
d71efb8
f7dfff1
3be6ca9
60c0d70
e48f5b7
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a risk that a 64 character hash might make image names too long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/opencontainers/distribution-spec/blob/main/spec.md#pulling-manifests says many implementations limit the hostname and name of the image in total to 256 chars. I think this means it may be good enough and not a problem?
Alternatively, I can go back to parsing persistent_id at
detect
time, instead of atfetch
time, and set it that way. I think part of the confusion here is arounddetect
semantics and whencontent_id
is called. Ideallydetect
should be stateless and be simply used to, well, detect things! But we seem to treat it as also the thing that sets.content_id
so it's a little bit of a mess. I'm happy to treat that as a different refactor though.Choice to be made
detect
Happy to do either!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that certainly doesn't sound right. It looks to me like we also only access content_id after calling fetch. Is it possible that the issue you are seeing is only in the tests, not how r2d actually behaves? What happens if you
raise
in content_id if fetch hasn't been called?If it's just the tests and persistent_id is defined after fetch, then keeping persistent_id seems nice here, and maybe we can fix the tests to be more realistic. And make it explicit that
content_id
cannot be assumed to be available untilfetch
has been called?A tangent I went on about hash length, that I'm not sure is relevant anymore, but already wrote down. Feel free to ignore:
Initially, I thought the content id was the full thing, but of course it's the 'ref' that goes after the doi itself. Running a test gives this 106-character image name:
Since we're in the namespace of the doi, collision probability is super low. We truncate to the short hash in git. So maybe truncate this hash, or use a natively shorter hash function like:
(blake2 is in
hashlib.algorithms_guaranteed
since 3.6, I think)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooooh, fixing the tests seems the right thing to do! I'll take a peek.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey, it looks like I already fixed the tests, so it's all fine now! Back to using persistent_id as the identifier, but this time it's 'proper' - so if we get a file persistent_id, we resolve it to the dataset persistent id, and use that. so if multiple different folks try to use different files from the same dataset, it will lead to cache reuse now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! All looks right to me, then.