Skip to content

feat: fetch all existing datasets for diff #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

abulte
Copy link
Collaborator

@abulte abulte commented Dec 18, 2024

This fetches all existing datasets (including archived and private ones) from the /api/1/datasets?topic=xxx endpoint (the only one where they're available).

The idea is that the diff would be complete, we would remove the archived and private datasets from the current topic.

Thus, it's an alternative to ecolabdata/ecospheres#498.

The problem is: it's very slow (due to the size of the topic and the slowness of the v1 API). I've stopped a dry-run after one hour on demo (89k datasets, stopped during initial fetch). It might be manageable after a reset of the demo topic with ecolabdata/ecospheres#510.

@streino
Copy link
Contributor

streino commented Dec 18, 2024

I'm guessing it'll still be slow given how large the topic is... Slower than it should be anyway...

Could we imagine adding options to fetch archived/private datasets to the api v2?

@abulte
Copy link
Collaborator Author

abulte commented Dec 18, 2024

Could we imagine adding options to fetch archived/private datasets to the api v2?

It kinds of goes against the "philosophy" of how/what is indexed on data.gouv.fr (v2 is ES-only for datasets lists), this would be a major change. ecolabdata/ecospheres#498 will probably be easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants