Skip to content

Improve API for run_recipe #129

@jheeffer

Description

@jheeffer

Let's think of how to redesign the API of run_recipe. This is still very work-in-progress ; )

Goal is a user friendly API that allows you to:

  • use a git repo as a dataset, local or remote
  • define a branch/commit for such git repo
  • use the remote head of a branch, through local repo to minimize network traffic. Maybe fetch & checkout to other folder. Or pull if it's fine to change local repo.
  • allow build from a local, private build environment, that's safe from interference in both ways (e.g. other process changes files during build (e.g. git checkout), or not wanting to pull/checkout a shared local git repo to get files you need for commit)
ddf run_recipe -o <output_path> --ddf_dir <dataset_path> -i <recipe_path> --show-tree -d -p --private --private_path <private_path> --remote-head

builds dataset using recipe and other base datasets

-i <recipe_path> defaults to ./etl/recipes/etl.yml or ./etl/recipes/etl.yaml
-o <output_path> defaults to ./
--ddf_dir <dataset_path> defaults to ./../../
--private [<private_path>] use private build environment. Copy any base datasets first to private folder
--private_path defaults to ./etl/source/ maybe?
--remote-head use the head of the origin remote branch for any local git repo dataset. git pull or when private git fetch && export to different folder

This allows being in a dataset's root and just running ddf run_recipe to build the dataset

Base dataset specifier format, used in recipes. Based on pip install.
We don't have an index yet, so options are either a local path or git url. If either is relative, it's relative to <dataset_path>

paths:

./open-numbers/ddf--gapminder--wdi
justonefolder

urls (get resources from ./datapackage.json, then download, though does not allow download of assets, etl folder, etc):

https://raw.githubusercontent.com/open-numbers/ddf--gapminder--co2_emission/autogenerated/
http://example.com/dataset/

git urls (git+ allows extending functionality to other applications later. git protocols)

git+ssh://github.com/open-numbers/ddf--gapminder--wdi
[email protected]:open-numbers/ddf--gapminder--wdi (scp style syntax for ssh)
git+https://github.com/open-numbers/ddf--gapminder--wdi
git+file:///home/user/git/open-numbers/ddf--gapminder--wdi
git+open-numbers/ddf--gapminder--wdi

Adding a branch name, a commit hash, a tag name or a git ref is possible like so:

git+https://github.com/open-numbers/ddf--gapminder--wdi@master
git+https://github.com/open-numbers/[email protected]
git+https://github.com/open-numbers/ddf--gapminder--wdi@da39a3ee5e6b4b0d3255bfef95601890afd80709
git+https://github.com/open-numbers/ddf--gapminder--wdi@refs/pull/123/head
git+open-numbers/ddf--gapminder--wdi@master

Maybe we don't need --private if we define the base branches/commits in the recipe? I don't recall the exact problems we had in datakitchen that lead us to thinking we'd need it. Should look that up in chat logs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions