Sketch to support (private) data-proxy dataset #61

mih · 2023-03-08T07:40:26Z

This change started out by merely adding the ability to recognize and process non-public dataset data-proxy URLs.

However, it was not enough to support such datasets, because the underlying fairgraph query to get a dataset's file listing returns no results.

The query is essentially this

batch = omcore.File.list(
    self.client,
    file_repository=dvr,
    size=chunk_size,
    from_index=cur_index)

and for the dataset referenced in
#58 it returns an empty list with

a properly authenticated client
dvr: FileRepository(name='buckets/d-07ab1665-73b0-40c5-800e-557bc319109d', iri=IRI(https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d)...
chunk_size: 10000
cur_index: 0

With the same requesting account, I can browser-visit https://data-proxy.ebrains.eu/datasets/07ab1665-73b0-40c5-800e-557bc319109d and see a file listing.

To mitigate this issue, the PR now queries the data proxy API directly -- leading to noticeable speed-ups.

This PR prepares for addressing #58, but does not achieve it, due to outstanding authentication issues.

'Absolute IRI confused with prefix.'

datalad-next now does wholistic parameter validation.

This change is merely adding the ability to recognize and process non-public dataset data-proxy URLs. However, it is not enough to support such datasets, because the underlying `fairgraph` query to get a dataset's file listing returns no results. The query is essentially this ```py batch = omcore.File.list( self.client, file_repository=dvr, size=chunk_size, from_index=cur_index) ``` and for the dataset referenced in #58 it returns an empty list with - a properly authenticated `client` - `dvr`: `FileRepository(name='buckets/d-07ab1665-73b0-40c5-800e-557bc319109d', iri=IRI(https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d)...` - `chunk_size`: 10000 - `cur_index`: 0 With the same requesting account, I can browser-visit https://data-proxy.ebrains.eu/datasets/07ab1665-73b0-40c5-800e-557bc319109d and see a file listing.

Previously we went via the KG to get a per-dataset file listing. This is extremely slow (#52). With this change, we query the data-proxy directly, if a data proxy bucket is detected to be a datasets's file repository. This speeds up operations by about 30x.

Otherwise the processing stops with a HTTP400 response.

codecov · 2023-07-14T10:40:05Z

Codecov Report

Patch coverage: 72.41% and project coverage change: -4.39 ⚠️

Comparison is base (75acaae) 94.86% compared to head (9286a61) 90.47%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #61      +/-   ##
==========================================
- Coverage   94.86%   90.47%   -4.39%     
==========================================
  Files           8        8              
  Lines         253      273      +20     
==========================================
+ Hits          240      247       +7     
- Misses         13       26      +13

Impacted Files	Coverage Δ
datalad_ebrains/tests/conftest.py	`78.57% <50.00%> (-11.43%)`	⬇️
datalad_ebrains/fairgraph_query.py	`89.58% <75.00%> (-7.30%)`	⬇️
datalad_ebrains/tests/test_clone.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

We need to start with the `dataset/` API endpoint, so get that URL to the point where requests need to be made.

mih mentioned this pull request Mar 8, 2023

Unrecognized file repository pointer for private dataset in ebrains #58

Open

mih added 8 commits July 14, 2023 11:07

Tighten the dependency on fairgraph to avoid JSON-LD error

51144e3

'Absolute IRI confused with prefix.'

Remove deprecated and unused intersphinx setup

19542b8

Tighten dependency on datalad-next to pin error behavior

2c549df

datalad-next now does wholistic parameter validation.

RF to prepare for non-KG dataset content queries

79a6e92

Let the tests accept an existing TOKEN env var

eee45f4

Make data proxy auth conditional resource being private

3786478

Otherwise the processing stops with a HTTP400 response.

mih force-pushed the privateds branch from 13591b3 to 3786478 Compare July 14, 2023 10:02

Refactor to improve alignment to data proxy API

9286a61

We need to start with the `dataset/` API endpoint, so get that URL to the point where requests need to be made.

mih changed the title ~~Sketch to support private data-proxy dataset~~ Sketch to support (private) data-proxy dataset Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sketch to support (private) data-proxy dataset #61

Sketch to support (private) data-proxy dataset #61

Uh oh!

mih commented Mar 8, 2023 •

edited

Loading

Uh oh!

codecov bot commented Jul 14, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sketch to support (private) data-proxy dataset #61

Are you sure you want to change the base?

Sketch to support (private) data-proxy dataset #61

Uh oh!

Conversation

mih commented Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mih commented Mar 8, 2023 •

edited

Loading

codecov bot commented Jul 14, 2023 •

edited

Loading