Skip to content

Commit 352f3a9

Browse files
committed
Merge branch 'dev'
2 parents 04334cc + 473c30b commit 352f3a9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

76 files changed

+4550
-12785
lines changed

CHANGELOG.md

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,21 @@
1+
## version 0.2.9 2017-01-03
2+
3+
- added tests for all procedures
4+
- removed outdated procedures (align, accumulate)
5+
- more api docs, and all docs are available in read the docs now
6+
- new `--ddf_dir` option for `ddf run_recipe` #45
7+
- add options for `serve` procedures and `serving` section. Now you should
8+
provide a list of dictionaries in `serving` section, instead of a list of
9+
ids as pervious version
10+
- improvements and bug fixes
11+
12+
## version 0.2.8 2016-12-13
13+
14+
- new proecedures: `window` (#25)
15+
- updated `groupby` procedure (#25)
16+
- updated `translate_column` procedure to include the function in `align` (#3)
17+
- minor improvements
18+
119
## version 0.2.7 2016-12-06
220

321
- use DAG to model the recipe. changes are:
@@ -9,9 +27,3 @@
927
- added support for serve section
1028
- renamed procedure `add_concepts` to `extract_concepts` #40
1129

12-
## version 0.28. 2016-12-13
13-
14-
- new proecedures: `window` (#25)
15-
- updated `groupby` procedure (#25)
16-
- updated `translate_column` procedure to include the function in `align` (#3)
17-
- minor improvements

README.md

Lines changed: 16 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,18 @@
11
# ddf_utils
22

3+
ddf_utils is a Python library and command line tool for people working with
4+
[Tabular Data Package][1] in [DDF model][2]. It provides various functions for [ETL tasks][3],
5+
including string formatting, data transforming, generating datapackage.json,
6+
reading data form DDF datasets, running [recipes][4], a decleative
7+
DSL designed to manipulate datasets to generate new datasets, and other
8+
functions we find useful in daily works in [Gapminder][5].
9+
10+
[1]: http://specs.frictionlessdata.io/tabular-data-package
11+
[2]: https://github.com/open-numbers/wiki/wiki/Introduction-to-DDF
12+
[3]: https://en.wikipedia.org/wiki/Extract,_transform,_load
13+
[4]: https://ddf-utils.readthedocs.io/en/latest/recipe.html
14+
[5]: https://www.gapminder.org/
15+
316
## Installation
417

518
We are using python3 only features such as type signature in this repo.
@@ -16,53 +29,7 @@ try updating setuptools to latest version:
1629

1730
and then reinstall ddf_utils should fix the problem.
1831

19-
## Commandline helper
20-
21-
we provide a commandline utility `ddf` for etl tasks. For now supported commands are:
22-
23-
```
24-
$ ddf --help
25-
Usage: ddf [OPTIONS] COMMAND [ARGS]...
26-
27-
Options:
28-
--help Show this message and exit.
29-
30-
Commands:
31-
cleanup clean up ddf files or translation files
32-
create_datapackage create datapackage.json
33-
merge_translation merge all translation files from crowdin
34-
new create a new ddf project
35-
run_recipe generate new ddf dataset with recipe
36-
split_translation split ddf files for crowdin translation
37-
```
38-
39-
for each subcommands, you can run `ddf <subcommand> --help` to get help
40-
of that subcommand
41-
42-
### Recipe
43-
44-
document for recipe: [link](https://github.com/semio/ddf--gapminder--systema_globalis/blob/feature/autogenerated/etl/recipes/README.md)
45-
46-
to run a recipe, simply run following:
47-
48-
```
49-
$ ddf run_recipe -i path_to_recipe -o outdir
50-
```
51-
52-
to run a recipe without saving the result into disk, run
53-
54-
```
55-
$ ddf run_recipe -i path_to_recipe -d
56-
```
57-
58-
note that you should set `ddf_dir`/`recipes_dir`/`dictionary_dir` correct in order
59-
the chef can find the correct file. if there are includes in the recipe, only the top
60-
level `ddf_dir` will be used (so the ddf_dir setting in sub-recipes will be ignored).
61-
62-
### useful API for etl tasks
63-
64-
You can check the api documents at [readthedoc][1] or clone this repo and read it in
65-
docs/_html. Note that the chef module document is not complete in readthedoc due to a
66-
bug in their system.
32+
## Usage
6733

68-
[1]: https://ddf-utils.readthedocs.io/en/latest/py-modindex.html
34+
Check the [documents](https://ddf-utils.readthedocs.io/en/latest/intro.html) for
35+
how to use ddf_utils.

ddf_utils/chef/cook.py

Lines changed: 67 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,9 @@
88

99
from . ingredient import *
1010
from . dag import DAG, IngredientNode, ProcedureNode
11+
from . helpers import read_opt
1112
from .. import config
1213
from . procedure import *
13-
from .. str import format_float_digits
1414

1515
import logging
1616

@@ -26,9 +26,30 @@ def _loadfile(f):
2626

2727

2828
# functions for reading/running recipe
29-
def build_recipe(recipe_file, to_disk=False):
30-
"""build a complete recipe file if there are includes in
31-
recipe file, if no includes found than return the file as is.
29+
def build_recipe(recipe_file, to_disk=False, **kwargs):
30+
"""build a complete recipe object.
31+
32+
This function will check each part of recipe, convert string (the ingredient ids,
33+
dictionaries file names) into actual objects.
34+
35+
If there are includes in recipe file, this function will run recurivly.
36+
If no includes found then return the parsed object as is.
37+
38+
Parameters
39+
----------
40+
recipe_file : `str`
41+
path to recipe file
42+
43+
Keyword Args
44+
------------
45+
to_disk : bool
46+
if true, save the parsed reslut to a yaml file in working dir
47+
48+
Other Parameters
49+
----------------
50+
ddf_dir : `str`
51+
path to search for DDF datasets, will overwrite the contfig in recipe
52+
3253
"""
3354
recipe = _loadfile(recipe_file)
3455

@@ -64,6 +85,12 @@ def build_recipe(recipe_file, to_disk=False):
6485

6586
recipe['cooking'][p][i]['options']['dictionary'] = _loadfile(path)
6687

88+
# setting ddf search path if option is provided
89+
if 'ddf_dir' in kwargs.keys():
90+
if 'config' not in recipe.keys():
91+
recipe['config'] = AttrDict()
92+
recipe.config.ddf_dir = kwargs['ddf_dir']
93+
6794
if 'include' not in recipe.keys():
6895
return recipe
6996
else: # append sub-recipe entities into main recipe
@@ -156,6 +183,10 @@ def check_dataset_availability(recipe):
156183

157184

158185
def build_dag(recipe):
186+
"""build a DAG model for the recipe.
187+
188+
For more detail for DAG model, see :py:mod:`ddf_utils.chef.dag`.
189+
"""
159190

160191
def add_dependency(dag, upstream_id, downstream):
161192
if not dag.has_task(upstream_id):
@@ -203,9 +234,11 @@ def add_dependency(dag, upstream_id, downstream):
203234
if not dag.has_task(i):
204235
raise ValueError('Ingredient not found: ' + i)
205236
if 'serving' in recipe.keys():
237+
if len(serving) > 0:
238+
raise ValueError('can not have serve procedure and serving section at same time!')
206239
for i in recipe['serving']:
207-
if not dag.has_task(i):
208-
raise ValueError('Ingredient not found: ' + i)
240+
if not dag.has_task(i['id']):
241+
raise ValueError('Ingredient not found: ' + i['id'])
209242
# display the tree
210243
# dag.tree_view()
211244
return dag
@@ -215,7 +248,25 @@ def run_recipe(recipe):
215248
"""run the recipe.
216249
217250
returns a dictionary. keys are `concepts`, `entities` and `datapoints`,
218-
and values are ingredients return by the procedures
251+
and values are ingredients defined in the `serve` procedures or `serving` section.
252+
for example:
253+
254+
.. code-block:: python
255+
256+
{
257+
"concepts": [{"ingredient": DataFrame1, "options": None}]
258+
"datapoints": [
259+
{
260+
"ingredient": DataFrame2,
261+
"options": {"digits": 5}
262+
},
263+
{
264+
"ingredient": DataFrame3,
265+
"options": {"digits": 1}
266+
},
267+
]
268+
}
269+
219270
"""
220271
try:
221272
config.DDF_SEARCH_PATH = recipe['config']['ddf_dir']
@@ -242,62 +293,28 @@ def run_recipe(recipe):
242293
func = p['procedure']
243294
if func == 'serve':
244295
ingredients = [dag.get_task(x).evaluate() for x in p['ingredients']]
245-
[dishes[k].append(i) for i in ingredients]
296+
opts = read_opt(p, 'options', default=dict())
297+
[dishes[k].append({'ingredient': i, 'options': opts}) for i in ingredients]
246298
continue
247299
out = dag.get_task(p['result']).evaluate()
248300
# if there is no seving procedures/section, use the last output Ingredient object as final result.
249301
if len(dishes[k]) == 0 and 'serving' not in recipe.keys():
250302
logger.warning('serving last procedure output for {}: {}'.format(k, out.ingred_id))
251-
dishes[k].append(out)
303+
dishes[k].append({'ingredient': out, 'options': dict()})
252304
# update dishes when there is serving section
253305
if 'serving' in recipe.keys():
254306
for i in recipe['serving']:
255-
ing = dag.get_task(i).evaluate()
307+
opts = read_opt(i, 'options', default=dict())
308+
ing = dag.get_task(i['id']).evaluate()
256309
if ing.dtype in dishes.keys():
257-
dishes[ing.dtype].append(ing)
310+
dishes[ing.dtype].append({'ingredient': ing, 'options': opts})
258311
else:
259-
dishes[ing.dtype] = [ing]
312+
dishes[ing.dtype] = [{'ingredient': ing, 'options': opts}]
260313
return dishes
261314

262315

263316
def dish_to_csv(dishes, outpath):
317+
"""save the recipe output to disk"""
264318
for t, ds in dishes.items():
265319
for dish in ds:
266-
all_data = dish.get_data()
267-
if isinstance(all_data, dict):
268-
for k, df in all_data.items():
269-
# change boolean into string
270-
for i, v in df.dtypes.iteritems():
271-
if v == 'bool':
272-
df[i] = df[i].map(lambda x: str(x).upper())
273-
if t == 'datapoints':
274-
by = dish.key_to_list()
275-
path = os.path.join(outpath, 'ddf--{}--{}--by--{}.csv'.format(t, k, '--'.join(by)))
276-
elif t == 'concepts':
277-
path = os.path.join(outpath, 'ddf--{}.csv'.format(t))
278-
elif t == 'entities':
279-
domain = dish.key[0]
280-
if k == domain:
281-
path = os.path.join(outpath, 'ddf--{}--{}.csv'.format(t, k))
282-
else:
283-
path = os.path.join(outpath, 'ddf--{}--{}--{}.csv'.format(t, domain, k))
284-
else:
285-
raise ValueError('Not a correct collection: ' + t)
286-
287-
if t == 'datapoints':
288-
df = df.set_index(by)
289-
if not np.issubdtype(df[k].dtype, np.number):
290-
try:
291-
df[k] = df[k].astype(float)
292-
# TODO: make floating precision an option
293-
df[k] = df[k].map(lambda x: format_float_digits(x, 5))
294-
except ValueError:
295-
logging.warning("data not numeric: " + k)
296-
else:
297-
df[k] = df[k].map(lambda x: format_float_digits(x, 5))
298-
df[[k]].to_csv(path, encoding='utf8')
299-
else:
300-
df.to_csv(path, index=False, encoding='utf8')
301-
else:
302-
path = os.path.join(outpath, 'ddf--{}.csv'.format(t))
303-
all_data.to_csv(path, index=False, encoding='utf8')
320+
dish['ingredient'].serve(outpath, **dish['options'])

ddf_utils/chef/dag.py

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,18 @@
11
# -*- coding: utf-8 -*-
22

3-
"""the DAG module of chef"""
3+
"""the DAG model of chef
4+
5+
The DAG consists of 2 types of nodes: IngredientNode and ProcedureNode.
6+
each node will have a `evaluate()` function, which will return an ingredient
7+
on eval.
8+
"""
49

510
import pandas as pd
611
from . import procedure as pc
712

813

914
class BaseNode():
15+
"""The base node which IngredientNode and ProcedureNode inherit from"""
1016
def __init__(self, node_id, dag):
1117
self.node_id = node_id
1218
self.dag = dag
@@ -61,6 +67,10 @@ def detect_downstream_cycle(self, task=None):
6167

6268

6369
class IngredientNode(BaseNode):
70+
"""Node for storing dataset ingredients.
71+
72+
The evaluate() function of this type of node will return the ingredient as is.
73+
"""
6474
def __init__(self, node_id, ingredient, dag):
6575
super(IngredientNode, self).__init__(node_id, dag)
6676
self.ingredient = ingredient
@@ -70,6 +80,11 @@ def evaluate(self):
7080

7181

7282
class ProcedureNode(BaseNode):
83+
"""The node for storing procedure results
84+
85+
The evaluate() function will run a procedure according to `self.procedure`, using
86+
other nodes' data. Other nodes will be evaluated if it's needed.
87+
"""
7388
def __init__(self, node_id, procedure, dag):
7489
super(ProcedureNode, self).__init__(node_id, dag)
7590
self.procedure = procedure
@@ -110,6 +125,14 @@ def evaluate(self):
110125

111126

112127
class DAG():
128+
"""The DAG model.
129+
130+
.. note::
131+
132+
the "task" in the functions is equal to "node". We will change to use
133+
same name later.
134+
135+
"""
113136
def __init__(self, task_dict=None):
114137
if not task_dict:
115138
self._task_dict = dict()
@@ -118,6 +141,7 @@ def __init__(self, task_dict=None):
118141

119142
@property
120143
def roots(self):
144+
"""return the roots of the DAG"""
121145
return [t for t in self.tasks if not t.downstream_list]
122146

123147
@property
@@ -133,6 +157,7 @@ def task_dict(self, task):
133157
raise AttributeError('can not set task_dict manually')
134158

135159
def add_task(self, task):
160+
"""add a node to DAG"""
136161
if task.node_id in self.task_dict.keys():
137162
# only overwirte case is when procedure in ProcedureNode is None.
138163
if (isinstance(task, ProcedureNode) and

0 commit comments

Comments
 (0)