-
Notifications
You must be signed in to change notification settings - Fork 380
Use REST APIs to resolve DOIs + cleanup dataverse provider #1390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 11 commits
91242e5
52eeb8f
bf40856
3eab292
172f8b0
1260a5a
b7050ba
fde74ef
96057f9
fda5339
f6037ca
b854b77
f9e3d70
53fba84
f4d58dc
d71efb8
f7dfff1
3be6ca9
60c0d70
e48f5b7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,11 @@ | ||
import hashlib | ||
import json | ||
import os | ||
import shutil | ||
from urllib.parse import parse_qs, urlparse, urlunparse | ||
from typing import List | ||
from urllib.parse import parse_qs, urlparse | ||
|
||
from ..utils import copytree, deep_get | ||
from ..utils import copytree, deep_get, is_doi | ||
from .doi import DoiProvider | ||
|
||
|
||
|
@@ -23,10 +25,11 @@ def __init__(self): | |
self.hosts = json.load(fp)["installations"] | ||
super().__init__() | ||
|
||
def detect(self, doi, ref=None, extra_args=None): | ||
"""Trigger this provider for things that resolve to a Dataverse dataset. | ||
def detect(self, spec, ref=None, extra_args=None): | ||
""" | ||
Detect if given spec is hosted on dataverse | ||
|
||
Handles: | ||
The spec can be: | ||
- DOI pointing to {siteURL}/dataset.xhtml?persistentId={persistentId} | ||
- DOI pointing to {siteURL}/file.xhtml?persistentId={persistentId}&... | ||
- URL {siteURL}/api/access/datafile/{fileId} | ||
|
@@ -35,9 +38,11 @@ def detect(self, doi, ref=None, extra_args=None): | |
- https://dataverse.harvard.edu/api/access/datafile/3323458 | ||
- doi:10.7910/DVN/6ZXAGT | ||
- doi:10.7910/DVN/6ZXAGT/3YRRYJ | ||
|
||
""" | ||
url = self.doi2url(doi) | ||
if is_doi(spec): | ||
url = self.doi2url(spec) | ||
else: | ||
url = spec | ||
# Parse the url, to get the base for later API calls | ||
parsed_url = urlparse(url) | ||
|
||
|
@@ -53,54 +58,117 @@ def detect(self, doi, ref=None, extra_args=None): | |
if host is None: | ||
return | ||
|
||
query_args = parse_qs(parsed_url.query) | ||
# Corner case handling | ||
if parsed_url.path.startswith("/file.xhtml"): | ||
# There's no way of getting file information using its persistentId, the only thing we can do is assume that doi | ||
# is structured as "doi:<dataset_doi>/<file_doi>" and try to handle dataset that way. | ||
new_doi = doi.rsplit("/", 1)[0] | ||
if new_doi == doi: | ||
# tough luck :( Avoid inifite recursion and exit. | ||
return | ||
return self.detect(new_doi) | ||
elif parsed_url.path.startswith("/api/access/datafile"): | ||
# Raw url pointing to a datafile is a typical output from an External Tool integration | ||
entity_id = os.path.basename(parsed_url.path) | ||
search_query = "q=entityId:" + entity_id + "&type=file" | ||
# Knowing the file identifier query search api to get parent dataset | ||
search_url = urlunparse( | ||
parsed_url._replace(path="/api/search", query=search_query) | ||
# Used only for content_id | ||
self.url = url | ||
|
||
# At this point, we *know* this is a dataverse URL, because: | ||
# 1. The DOI resolved to a particular host (if using DOI) | ||
# 2. The host is in the list of known dataverse installations | ||
# | ||
# We don't know exactly what kind of dataverse object this is, but | ||
# that can be figured out during fetch as needed | ||
return {"host": host, "url": url} | ||
|
||
def get_dataset_id_from_file_id(self, host: str, file_id: str) -> str: | ||
""" | ||
Return the persistent_id (DOI) that a given file_id (int or doi) belongs to | ||
""" | ||
if file_id.isdigit(): | ||
# the file_id is an integer, rather than a persistent id (DOI) | ||
api_url = f"{host}/api/files/{file_id}?returnDatasetVersion=true" | ||
else: | ||
# the file_id is a doi itself | ||
api_url = f"{host}/api/files/:persistentId?persistentId={file_id}&returnDatasetVersion=true" | ||
|
||
resp = self._request(api_url) | ||
if resp.status_code == 404: | ||
raise ValueError(f"File with id {file_id} not found in {host}") | ||
|
||
resp.raise_for_status() | ||
|
||
data = resp.json()["data"] | ||
return data["datasetVersion"]["datasetPersistentId"] | ||
|
||
def get_datafiles(self, dataverse_host: str, url: str) -> List[dict]: | ||
""" | ||
Return a list of dataFiles for given persistent_id | ||
|
||
Supports the following *dataset* URL styles: | ||
- /citation: https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/TJCLKP | ||
- /dataset.xhtml: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP | ||
|
||
Supports the following *file* URL styles: | ||
- /api/access/datafile: https://dataverse.harvard.edu/api/access/datafile/3323458 | ||
|
||
Supports a subset of the following *file* URL styles: | ||
- /file.xhtml: https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/6ZXAGT/3YRRYJ | ||
yuvipanda marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
If a URL can not be parsed, throw an exception | ||
""" | ||
|
||
parsed_url = urlparse(url) | ||
path = parsed_url.path | ||
qs = parse_qs(parsed_url.query) | ||
dataverse_host = f"{parsed_url.scheme}://{parsed_url.netloc}" | ||
url_kind = None | ||
yuvipanda marked this conversation as resolved.
Show resolved
Hide resolved
|
||
persistent_id = None | ||
is_ambiguous = False | ||
|
||
# https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/TJCLKP | ||
if path.startswith("/citation"): | ||
is_ambiguous = True | ||
persistent_id = qs["persistentId"][0] | ||
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP | ||
elif path.startswith("/dataset.xhtml"): | ||
# https://dataverse.harvard.edu/api/access/datafile/3323458 | ||
persistent_id = qs["persistentId"][0] | ||
elif path.startswith("/api/access/datafile"): | ||
# What we have here is an entity id, which we can use to get a persistentId | ||
file_id = os.path.basename(parsed_url.path) | ||
persistent_id = self.get_dataset_id_from_file_id(dataverse_host, file_id) | ||
elif parsed_url.path.startswith("/file.xhtml"): | ||
file_persistent_id = qs["persistentId"][0] | ||
persistent_id = self.get_dataset_id_from_file_id( | ||
dataverse_host, file_persistent_id | ||
) | ||
else: | ||
raise ValueError( | ||
f"Could not determine persistent id for dataverse URL {url}" | ||
) | ||
|
||
dataset_api_url = ( | ||
f"{dataverse_host}/api/datasets/:persistentId?persistentId={persistent_id}" | ||
) | ||
resp = self._request(dataset_api_url, headers={"accept": "application/json"}) | ||
if resp.status_code == 404 and is_ambiguous: | ||
# It's possible this is a *file* persistent_id, not a dataset one | ||
persistent_id = self.get_dataset_id_from_file_id( | ||
dataverse_host, persistent_id | ||
) | ||
dataset_api_url = f"{dataverse_host}/api/datasets/:persistentId?persistentId={persistent_id}" | ||
resp = self._request( | ||
dataset_api_url, headers={"accept": "application/json"} | ||
) | ||
self.log.debug("Querying Dataverse: " + search_url) | ||
data = self.urlopen(search_url).json()["data"] | ||
if data["count_in_response"] != 1: | ||
self.log.debug( | ||
f"Dataverse search query failed!\n - doi: {doi}\n - url: {url}\n - resp: {json.dump(data)}\n" | ||
) | ||
return | ||
|
||
self.record_id = deep_get(data, "items.0.dataset_persistent_id") | ||
elif ( | ||
parsed_url.path.startswith("/dataset.xhtml") | ||
and "persistentId" in query_args | ||
): | ||
self.record_id = deep_get(query_args, "persistentId.0") | ||
|
||
if hasattr(self, "record_id"): | ||
return {"record": self.record_id, "host": host} | ||
|
||
if resp.status_code == 404: | ||
# This persistent id is just not here | ||
raise ValueError(f"{persistent_id} on {dataverse_host} is not found") | ||
|
||
# We already handled 404, raise error for everything else | ||
resp.raise_for_status() | ||
|
||
data = resp.json()["data"] | ||
|
||
return data["latestVersion"]["files"] | ||
|
||
def fetch(self, spec, output_dir, yield_output=False): | ||
"""Fetch and unpack a Dataverse dataset.""" | ||
record_id = spec["record"] | ||
url = spec["url"] | ||
host = spec["host"] | ||
|
||
yield f"Fetching Dataverse record {record_id}.\n" | ||
url = f'{host["url"]}/api/datasets/:persistentId?persistentId={record_id}' | ||
|
||
resp = self.urlopen(url, headers={"accept": "application/json"}) | ||
record = resp.json()["data"] | ||
yield f"Fetching Dataverse record {url}.\n" | ||
|
||
for fobj in deep_get(record, "latestVersion.files"): | ||
for fobj in self.get_datafiles(host["url"], url): | ||
file_url = ( | ||
# without format=original you get the preservation format (plain text, tab separated) | ||
f'{host["url"]}/api/access/datafile/{deep_get(fobj, "dataFile.id")}?format=original' | ||
|
@@ -129,4 +197,4 @@ def fetch(self, spec, output_dir, yield_output=False): | |
@property | ||
def content_id(self): | ||
"""The Dataverse persistent identifier.""" | ||
return self.record_id | ||
return hashlib.sha256(self.url.encode()).hexdigest() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a risk that a 64 character hash might make image names too long? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://github.com/opencontainers/distribution-spec/blob/main/spec.md#pulling-manifests says many implementations limit the hostname and name of the image in total to 256 chars. I think this means it may be good enough and not a problem? Alternatively, I can go back to parsing persistent_id at Choice to be made
Happy to do either! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, that certainly doesn't sound right. It looks to me like we also only access content_id after calling fetch. Is it possible that the issue you are seeing is only in the tests, not how r2d actually behaves? What happens if you If it's just the tests and persistent_id is defined after fetch, then keeping persistent_id seems nice here, and maybe we can fix the tests to be more realistic. And make it explicit that A tangent I went on about hash length, that I'm not sure is relevant anymore, but already wrote down. Feel free to ignore: Initially, I thought the content id was the full thing, but of course it's the 'ref' that goes after the doi itself. Running a test gives this 106-character image name:
Since we're in the namespace of the doi, collision probability is super low. We truncate to the short hash in git. So maybe truncate this hash, or use a natively shorter hash function like: hashlib.blake2s(self.url.encode(), digest_size=10).hexdigest() (blake2 is in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ooooh, fixing the tests seems the right thing to do! I'll take a peek. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hey, it looks like I already fixed the tests, so it's all fine now! Back to using persistent_id as the identifier, but this time it's 'proper' - so if we get a file persistent_id, we resolve it to the dataset persistent id, and use that. so if multiple different folks try to use different files from the same dataset, it will lead to cache reuse now! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great! All looks right to me, then. |
Uh oh!
There was an error while loading. Please reload this page.