|
| 1 | +=========================== |
| 2 | +Functions, nodes & dataflow |
| 3 | +=========================== |
| 4 | + |
| 5 | +On this page, you'll learn how Hamilton converts your Python functions into nodes and then creates a dataflow. |
| 6 | + |
| 7 | +Functions |
| 8 | +--------- |
| 9 | + |
| 10 | +Hamilton requires you to write your code using functions. To get started, you simply need to: |
| 11 | + |
| 12 | +- `Annotate the type <https://docs.python.org/3/library/typing.html>`_ of the function parameters and return value. |
| 13 | +- Specify the function's dependency with the parameters' name. |
| 14 | +- Store your code in Python modules (``.py`` files). |
| 15 | + |
| 16 | +Since your code doesn't depend on special "Hamilton code", it can be reused any other way you want! |
| 17 | + |
| 18 | +Specifying dependencies |
| 19 | +~~~~~~~~~~~~~~~~~~~~~~~ |
| 20 | +In Hamilton, you define dependencies by matching parameter names with the names of other functions. Below, the function name and return type ``A() -> int``match the parameter ``A: int`` found in functions ``B()`` and ``C()``. |
| 21 | + |
| 22 | +.. code-block:: python |
| 23 | +
|
| 24 | + def A() -> int: |
| 25 | + """Constant value 35""" |
| 26 | + return 35 |
| 27 | +
|
| 28 | + def B(A: int) -> float: |
| 29 | + """Divide A by 3""" |
| 30 | + return A / 3 |
| 31 | +
|
| 32 | + def C(A: int, B: float) -> float: |
| 33 | + """Square A and multiply by B""" |
| 34 | + return A**2 * B |
| 35 | +
|
| 36 | +
|
| 37 | +.. image:: ../_static/abc_basic.png |
| 38 | + :align: center |
| 39 | + |
| 40 | +The figure shows how Hamilton automatically assembled the functions ``A()``, ``B()``, and ``C()``. |
| 41 | + |
| 42 | +Helper function |
| 43 | +~~~~~~~~~~~~~~~~ |
| 44 | + |
| 45 | +You can prefix a function name with an underscore (``_``) to prevent it from being included in a dataflow. Below, ``A()`` and ``B()`` are part of the dataflow, but ``_round_three_decimals()`` isn't. |
| 46 | + |
| 47 | +.. code-block:: python |
| 48 | +
|
| 49 | + def _round_three_decimals(value: float) -> float: |
| 50 | + """Round value by 3 decimals""" |
| 51 | + return round(value, 3) |
| 52 | +
|
| 53 | + def A(external_input: int) -> int: |
| 54 | + """Modulo 3 of input value""" |
| 55 | + return external_input % 3 |
| 56 | +
|
| 57 | + def B(A: int) -> float: |
| 58 | + """Divide A by 3""" |
| 59 | + b = A / 3 |
| 60 | + return _round_three_decimals(b) |
| 61 | +
|
| 62 | +
|
| 63 | +Function naming tips |
| 64 | +~~~~~~~~~~~~~~~~~~~~ |
| 65 | +Hamilton strongly agrees with the `Zen of Python <https://peps.python.org/pep-0020/>`_ #2: "Explicit is better than implicit". Meaningful function names help document what functions do, so don't shy away from longer names. If you were to come across a function named ``life_time_value`` versus ``ltv`` versus ``l_t_v``, which one is most obvious? Remember your code usually lives a lot longer than you ever think it will. |
| 66 | + |
| 67 | +Unlike the common practice of including meaningful verbs in function names (e.g., ``get_credentials()``, ``statistical_test()``), with Hamilton, the function name should more closely align with nouns. That's because the function name determines the node name and how data will be queried. Therefore, names that describe the node result rather than its action may be more readable (e.g., ``credentials()``, ``statistical_results()``). |
| 68 | + |
| 69 | + |
| 70 | +Nodes |
| 71 | +----- |
| 72 | + |
| 73 | +A node is a single "step" in a dataflow. Hamilton users write Python `functions` that Hamilton converts into `nodes`. They never directly create nodes. |
| 74 | + |
| 75 | + |
| 76 | +Anatomy of a node |
| 77 | +~~~~~~~~~~~~~~~~~ |
| 78 | +The following figure and table detail how a Python function maps to a Hamilton node. |
| 79 | + |
| 80 | + |
| 81 | +.. image:: ../_static/function_anatomy.png |
| 82 | + :scale: 13% |
| 83 | + :align: center |
| 84 | + |
| 85 | + |
| 86 | +.. list-table:: |
| 87 | + :header-rows: 1 |
| 88 | + |
| 89 | + * - id |
| 90 | + - Function components |
| 91 | + - Node components |
| 92 | + * - 1 |
| 93 | + - Function name and return type annotation |
| 94 | + - Node name and type |
| 95 | + * - 2 |
| 96 | + - Parameter(s) name and type annotation |
| 97 | + - Node dependencies |
| 98 | + * - 3 |
| 99 | + - Docstring |
| 100 | + - Description of the node return value |
| 101 | + * - 4 |
| 102 | + - Function body |
| 103 | + - Implementation of the node |
| 104 | + |
| 105 | + |
| 106 | +Since functions almost always map 1-to-1 to nodes, the two terms are used interchangeably. However, there are exceptions that we'll discuss later in this guide. |
| 107 | + |
| 108 | +Dataflow |
| 109 | +-------- |
| 110 | + |
| 111 | +From a collection of nodes, Hamilton automatically assembles the dataflow. For each node, it creates edges between itself and its dependencies, resulting in a `dataflow <https://en.wikipedia.org/wiki/Dataflow_programming>`_ (or a `graph <https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)>`_ in more mathematical terms). |
| 112 | + |
| 113 | +From the user perspective, you just have to give Hamilton a Python module containing your functions for it to generate your dataflow! This is a key difference with popular orchestration / pipeline / workflow frameworks (Airflow, Kedro, Prefect, VertexAI, SageMaker, etc.) |
| 114 | + |
| 115 | +How other frameworks build graphs |
| 116 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 117 | +In most frameworks, you first define steps / tasks / components. Then, you need to create your dataflow by explicitly specifying the relationship between each node. |
| 118 | + |
| 119 | +Readability |
| 120 | +^^^^^^^^^^^ |
| 121 | +In that case, the code for ``step A`` doesn't tell you how it relates ``step B`` or the broader dataflow. Hamilton solves this problem by tying functions, nodes, and dataflow definitions in a single place. The ratio of reading to writing code can be as high as `10:1 <https://www.goodreads.com/quotes/835238-indeed-the-ratio-of-time-spent-reading-versus-writing-is>`_, especially for complex dataflows, so optimizing for readability is very high-value. |
| 122 | + |
| 123 | +Maintainability |
| 124 | +^^^^^^^^^^^^^^^ |
| 125 | +Typically, editing a dataflow (new feature, debugging, etc.) alters both what a **node** does and how the **dataflow** is structured. Consequently, changes to ``step A`` require you to manually ensure consistent edits to the definition of dataflows, which is likely in another file. In enterprise settings, it can become difficult to discover and track every location ``step A`` is used (potentially 10s or 100s of pipelines), increasing the likelihood of breaking changes. Hamilton avoids entirely this problem because changes to the node definitions, and thus the dataflow, will propagate to all places this code is used. This greatly improves maintainability and development speed by facilitating code changes. |
| 126 | + |
| 127 | +Recap |
| 128 | +-------- |
| 129 | +- Users write Python functions into modules with proper naming and typing |
| 130 | +- Helper functions use an underscore prefix (e.g., ``_helper()``) |
| 131 | +- Hamilton converts functions into nodes |
| 132 | +- Hamilton automatically assembles nodes into a dataflow |
| 133 | + |
| 134 | + |
| 135 | +Next step |
| 136 | +--------- |
| 137 | +So far, we learned how to write Hamilton code for our dataflow. Next, we'll explore how we can effectively |
| 138 | + |
| 139 | +1. Convert a Python module into dataflow |
| 140 | +2. Visualize a dataflow |
| 141 | +3. Execute a dataflow |
| 142 | +4. Gather and store results of a dataflow |
0 commit comments