Skip to content

Conversation

@mih
Copy link
Member

@mih mih commented Mar 8, 2023

This change started out by merely adding the ability to recognize and process non-public dataset data-proxy URLs.

However, it was not enough to support such datasets, because the underlying fairgraph query to get a dataset's file listing returns no results.

The query is essentially this

batch = omcore.File.list(
    self.client,
    file_repository=dvr,
    size=chunk_size,
    from_index=cur_index)

and for the dataset referenced in
#58 it returns an empty list with

  • a properly authenticated client
  • dvr: FileRepository(name='buckets/d-07ab1665-73b0-40c5-800e-557bc319109d', iri=IRI(https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d)...
  • chunk_size: 10000
  • cur_index: 0

With the same requesting account, I can browser-visit https://data-proxy.ebrains.eu/datasets/07ab1665-73b0-40c5-800e-557bc319109d and see a file listing.

To mitigate this issue, the PR now queries the data proxy API directly -- leading to noticeable speed-ups.

This PR prepares for addressing #58, but does not achieve it, due to outstanding authentication issues.

mih added 8 commits July 14, 2023 11:07
datalad-next now does wholistic parameter validation.
This change is merely adding the ability to recognize and
process non-public dataset data-proxy URLs.

However, it is not enough to support such datasets, because
the underlying `fairgraph` query to get a dataset's file listing
returns no results.

The query is essentially this

```py
batch = omcore.File.list(
    self.client,
    file_repository=dvr,
    size=chunk_size,
    from_index=cur_index)
```

and for the dataset referenced in
#58 it returns an empty
list with

- a properly authenticated `client`
- `dvr`: `FileRepository(name='buckets/d-07ab1665-73b0-40c5-800e-557bc319109d', iri=IRI(https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d)...`
- `chunk_size`: 10000
- `cur_index`: 0

With the same requesting account, I can browser-visit
https://data-proxy.ebrains.eu/datasets/07ab1665-73b0-40c5-800e-557bc319109d
and see a file listing.
Previously we went via the KG to get a per-dataset file listing.  This
is extremely slow
(#52).

With this change, we query the data-proxy directly, if a data proxy
bucket is detected to be a datasets's file repository.

This speeds up operations by about 30x.
Otherwise the processing stops with a HTTP400 response.
@codecov
Copy link

codecov bot commented Jul 14, 2023

Codecov Report

Patch coverage: 72.41% and project coverage change: -4.39 ⚠️

Comparison is base (75acaae) 94.86% compared to head (9286a61) 90.47%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #61      +/-   ##
==========================================
- Coverage   94.86%   90.47%   -4.39%     
==========================================
  Files           8        8              
  Lines         253      273      +20     
==========================================
+ Hits          240      247       +7     
- Misses         13       26      +13     
Impacted Files Coverage Δ
datalad_ebrains/tests/conftest.py 78.57% <50.00%> (-11.43%) ⬇️
datalad_ebrains/fairgraph_query.py 89.58% <75.00%> (-7.30%) ⬇️
datalad_ebrains/tests/test_clone.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

We need to start with the `dataset/` API endpoint, so get that URL
to the point where requests need to be made.
@mih mih changed the title Sketch to support private data-proxy dataset Sketch to support (private) data-proxy dataset Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants