Skip to content

Abstract datagouv interactions #464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from
Draft

Abstract datagouv interactions #464

wants to merge 11 commits into from

Conversation

Pierlou
Copy link
Contributor

@Pierlou Pierlou commented Mar 20, 2025

New syntaxes:

  • client:
prod_client = Client(api_key=DATAGOUV_SECRET_API_KEY)
demo_client = Client(environment="demo", api_key=DEMO_DATAGOUV_SECRET_API_KEY)
visiter_client = Client()  # this one can only get data
  • common to datasets and resources:
client.Dataset(dataset_id).front_url
client.Dataset(dataset_id).get_metadata()
client.Resource(resource_id, dataset_id).get_metadata()  # dataset_id is optional, retrieved if not speficied
client.Dataset(dataset_id).update_metadata({"title": "New title"})
client.Dataset(dataset_id).delete()
client.Dataset(dataset_id).update_extras(payload)
client.Dataset(dataset_id).delete_extras(payload)
  • dataset related:
client.Dataset().create(payload)
  • resource related:
client.Resource().create_remote(
    dataset_id=dataset_id,
    payload={
        'title': 'Mon titre',
        'description': 'Ma description',
        'url': 'https://url.to/ressource.csv',
        'type': 'main',
        'format': 'csv',
    },
)
client.Resource().create_static(
    file_to_upload=MyFile,  # currently a File instance from utils/filesystem.py
    dataset_id=dataset_id,
    payload={
        'title': 'Mon titre',
        'description': 'Ma description',
        'type': 'main',
        'format': 'csv',
    },
)

Ideally, I'd like the creation functions to be class or static methods, because it feels weird to have to instanciate a Dataset or Resource to be able to create one, but I have not managed to find a way to do that, as both DatasetCreator and ResourceCreator need to be given the instanciated client.
An other syntax could be:

client.create_dataset(...)
client.create_static_resource(...)

@hacherix
Copy link
Contributor

hacherix commented Mar 27, 2025

That's amazing!

The API keys could also be provided through environment variables

Otherwise I like your last proposal. Something like:

from datagouv import Client, Dataset, Resource

client = Client(api_key=DATAGOUV_SECRET_API_KEY)

my_dataset: Dataset = client.create_dataset(...)
my_dataset.update_metadata({...})
my_resource_1: Resource = my_dataset.create_static_resource(...)
my_resource_2 = client.create_static_resource(dataset_id=my_dataset.dataset_id, ...)

same_resource_1 = client.dataset(dataset_id=my_resource_1.dataset_id).resource(resource_id=my_resource_1.resource_id)

It makes sense to me.

But:

dataset = client.dataset.create(...)

does not feel off to me. If create is a static method we instantiate Dataset only after the creation.

@Pierlou
Copy link
Contributor Author

Pierlou commented Mar 27, 2025

Thanks for the feeback 🙏 I don't know what I prefer between client.Dataset().create(...) (or even better client.Dataset.create(...)) and client.create_dataset(...) 🤔 but it would really make sense to be able to do Dataset(dataset_id).create_static_resource(...) too! I don't know if we want to handle both syntaxes or force one for internal simplicity
Also I'll change the code so that creation returns an instance of the associated object, you're right!
And I don't know to what extent we would like to have the metadata accessible in the object, like Resource(resource_id).created_at (or Resource(resource_id)["created_at"]), or even Dataset(dataset_id).resources (to get a list of Resources) 🤔

@hacherix
Copy link
Contributor

hacherix commented Mar 28, 2025

Thanks for the feeback 🙏 I don't know what I prefer between client.Dataset().create(...) (or even better client.Dataset.create(...)) and client.create_dataset(...) 🤔 but it would really make sense to be able to do Dataset(dataset_id).create_static_resource(...) too! I don't know if we want to handle both syntaxes or force one for internal simplicity

We can go with both syntax imo. Since it will be a static method, one of the two can just be a wrapper around the other one to make it easier to maintain.

And I don't know to what extent we would like to have the metadata accessible in the object, like Resource(resource_id).created_at (or Resource(resource_id)["created_at"]), or even Dataset(dataset_id).resources (to get a list of Resources) 🤔

Looking at the API https://guides.data.gouv.fr/guide-data.gouv.fr/readme-1/reference/datasets we could make only the required elements directly accessible in the object. And the others would have to go through the dict/json so it is more flexible. What do you think?

@Pierlou
Copy link
Contributor Author

Pierlou commented Apr 3, 2025

This is what a workflow could look like with improvements made after your feedback 🙏 :

from datagouv import Client
client = Client(api_key=DATAGOUV_SECRET_API_KEY)
my_dataset = client.dataset().create({"title": "Brand new dataset"})  # this creates the dataset online, and returns an instance of Dataset
print(my_dataset.created_at)  # Datasets and Resources have some (the list can be refined) attributes set from their metadata

# let's populate our new dataset
for file in files:
    resource = client.resource().create_static(
        file_to_upload={"source_path": file["path"], "source_name": file["name"]},
        dataset_id=my_dataset.id,
        payload={"title": file["title"]},
    )
    # alternatively, it's possible to create a resource from the dataset itself, in which case you don't have to specify the dataset_id
    # resource = my_dataset.create_static(
    #     file_to_upload={"source_path": file["path"], "source_name": file["name"]},
    # )
    # both return an instance of Resource
    print(resource.url)  # url is a Resource only attribute

# and we also have a documentation online
remote_resource = my_dataset.create_remote(
    payload={"url": "http://url/to/doc.pdf", "title": "Documentation", "type": "documentation"},
)
print(remote_resource)  # print displays all the Resource's attributes in a dict

my_dataset.update({"title": "The true title"})  # the dataset's title is modified online
print(my_dataset.title)  # and the new title is directly changed in the object

We handle communautary resources as well, and can update extras of objects.
Fully tested on demo
What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants