TreeHouse™ is an open source data product that models, transforms and flattens hierarchical data (i.e. tree structures) for data warehouse / lakehouse systems, primarily to simplify and accelerate BI and reporting use cases that require hierarchies.
TreeHouse is currently scoped to SAP hierarchies in particular.
The TreeHouse project is an independent open-source initiative and is not affiliated with, endorsed, or sponsored by SAP SE or its affiliates. SAP® and other SAP product names are trademarks or registered trademarks of SAP SE (or its affiliates) in Germany and other countries.
TreeHouse™ is an open source data product that models, transforms and flattens hierarchical data (i.e. tree structures) for data warehouse / lakehouse systems, primarily to simplify and accelerate BI and reporting use cases that require hierarchies (i.e. financial reporting).
TreeHouse is currently scoped to SAP hierarchies in particular.
The need for a product like TreeHouse arises from the fact that SAP's pre-packaged data warehouse content, including hierarchy data models and ETL code, is only available from SAP's integrated data warehouse products, i.e. Business Warehouse and Datasphere. Thus, SAP customers integrating SAP hierarchy data into 3rd party data platforms such as Snowflake or Databricks (or even stand-alone products like SAP HANA) are forced to reverse engineer the exact same logic that's been solved for years in SAP's own solutions.
TreeHouse thus solves this particular need, providing an end-to-end, fully open source data product for modeling and transforming SAP hierarchies for any target SQL-based (or even MDX-based) data platform.
Developers implementing TreeHouse should be familiar with:
- 3NF and dimensional data modeling
- hierarchies* (and related terminology)
- intermediate SQL (including recursive common table expressions (CTEs))
While not required, it is helpful for developers to have familiarity with the SAP GUI, ABAP Data Dictionary, and helpful online resources such as se80.co.uk.
Tip
Most LLMs have a relatively strong understanding of the SAP data model and SQL (including recursive CTEs), and thus can generally be leveraged to help answer questions related to TreeHouse. But you're also welcome to reach out to @jchesch with any questions.
Tip
Developers unfamiliar with hierarchies may benefit from this 6-part series from Snap Analytics.
The goals of TreeHouse are three-fold:
1. Semantic model: capture and share the end-to-end ETL business logic and data models required for transforming and persisting SAP hierarchies that can easily be implemented in any legacy or modern data stack.
2. Simplicity: modularize the hierarchy "meta-model", drastically simplifying the overall codebase.
3. Performance: share parallelizable orchestration logic that can maximize performance.
TreeHouse is currently implemented in Snowflake in order to provide the community with a reference implementation on one of the more widely adoped platforms (assuming a platform-specific implementation is more helpful than generic ISO/ANSI SQL code).
Note
TreeHouse is written primarily in standard SQL on Snowflake, making it easily portable to other platforms. To try out TreeHouse, spin up a free trial of Snowflake and check out the Try It Out section.
That being the case, TreeHouse is NOT particularly opinionated about:
-
Data platform (Snowflake, Databricks, BigQuery, Redshift, etc.)
-
Cross-cutting concerns (input validation, exception handling, logging, orchestration, namespaces, etc.)
-
Implementation (UDTFs, stored procedures, dbt/SQLMesh models, language, etc.)
-
Physical data model (i.e. physical tables vs. logical views)
Such decisions and concerns should be addressed via customer-specific standards.
Tip
TreeHouse provides hiearchy data models for both SQL and MDX consumption. Worth highlighting is Cube.dev's MDX API that can help bridge the gap between MDX front-ends and SQL back-ends for "hybrid" BI stacks.
The standard ABAP data dictionary entries for SAP hierarchy tables are rather lacking. Thus, a custom data dictionary has been provided below for the core hierarchy tables from SAP (SETHEADER
, SETNODE
and SETLEAF
).
Note
The terms "hierarchy type" and "master data type" are used somewhat interchangeably, since hierarchy types correspond directly with master data types. For example, a "proft center hierarchy" corresponds with "profit center master data".
Note that only those columns relevant to hierarchy processing are documented below. (See the ABAP data dictionary, or sites like se80.co.uk, for descriptions of additional columns if/as needed.)
SETHEADER
can be thought of as master data for all root and inner nodes. In the context of SAP hierarchies, it's only really needed for retrieving root nodes. (Inner nodes need to come from SETNODE
due to additional required attributes that are not found in SETHEADER
.)
Primary Key | Column Name | Alias | Description |
---|---|---|---|
X | MANDT |
CLIENT_ID |
SAP Client ID (unique per tenant) |
X | SETCLASS |
HIER_TYPE_ID |
Hierarchy type ID (i.e. '0106' represents "Profit Center Group", '0101' represents "Cost Center Group", etc.) - See lookup table SETCLST for more. |
X | SUBCLASS |
SUB_KEY_ID |
The second field of standard SAP unique identifiers (compound primary keys), where relevant. It corresponds with compound key fields like KTOPL (Chart of Accounts), KOKRS (Controlling Area), etc. from various master data tables. |
X | SETNAME |
NODE_ID |
Unique identifier for each SAP hierarchy node (excluding leaf nodes). Sometimes aliased as CHILD_NODE_ID to match other data sources when UNION ed. |
SETTYPE |
4 possible values used for filtering: S, D, B, M. S - Single dimension, i.e. "Profit Center" for example. D - Data, i.e. key figure sets. Out of scope for TreeHouse. B - Basic set, points directly to leaf nodes basically. M - Multidimensional. Out of scope for TreeHouse. |
SETNODE
stores parent-child relationships for inner nodes.
Primary Key | Column Name | Alias | Description |
---|---|---|---|
X | MANDT |
CLIENT_ID |
SAP Client ID (unique per tenant) |
X | SETCLASS |
HIER_TYPE_ID |
Hierarchy type ID (i.e. '0106' represents "Profit Center Group", '0101' represents "Cost Center Group", etc.) - See lookup table SETCLST for more. |
X | SUBCLASS |
SUB_KEY_ID |
The second field of standard SAP unique identifiers (compound primary keys), where relevant. It corresponds with compound key fields like KTOPL (Chart of Accounts), KOKRS (Controlling Area), etc. |
X | SETNAME |
NODE_ID |
Unique identifier for each SAP hierarchy node (excluding leaf nodes). In the context of SETNODE , it is the "parent" node, whereas SUBSETNAME below is the "child" node. |
X | LINEID |
Not used in the context of hierarchies. Can cause duplicates and multi-parent hierarchies, which need to be managed carefully. See Data Quality discussion below. | |
SUBSETCLS |
Irrelevant for TreeHouse. See Data Quality discussion below. | ||
SUBSETSCLS |
Irrelevant for TreeHouse. See Data Quality discussion below. | ||
SUBSETNAME |
CHILD_NODE_ID |
Child NODE_ID for a given parent/child relationship in SETNODE , i.e. SETNAME /SUBSETNAME . |
|
SEQNR |
Sequence number for capturing the sort order of a given node (i.e. amongst its siblings) |
SETLEAF
stores relationships between "penultimate" (second-to-last) nodes and leaf nodes. The historical motivation for this design from SAP is saving on storage, i.e. storing a single record that provides wildcard values and/or ranges that can join to many right-hand records from master data tables.
Primary Key | Column Name | Alias | Description |
---|---|---|---|
X | MANDT |
CLIENT_ID |
SAP Client ID (unique per tenant) |
X | SETCLASS |
HIER_TYPE_ID |
Hierarchy type ID (i.e. '0106' represents "Profit Center Group", '0101' represents "Cost Center Group", etc.) - See lookup table SETCLST for more. |
X | SUBCLASS |
SUB_KEY_ID |
The second field of standard SAP unique identifiers (compound primary keys), where relevant. It corresponds with compound key fields like KTOPL (Chart of Accounts), KOKRS (Controlling Area), etc. |
X | SETNAME |
NODE_ID |
Unique identifier for each SAP hierarchy node (excluding leaf nodes). Also aliased as CHILD_NODE_ID in the context of hierarchical relationships. |
X | LINEID |
Not used in the context of hierarchies. Can cause duplicates and multi-parent hierarchies, which need to be managed carefully. See Data Quality discussion below. | |
VALOPTION |
Join condition type (EQ , BT , CP ). |
||
VALFROM |
CHILD_NODE_ID_FROM |
Leaf node (retrieved based on dynamic INNER JOIN logic determined by VALOPTION ). |
|
VALTO |
CHILD_NODE_ID_TO |
Used in range join when VALOPTION = 'BT' , i.e. ... INNER JOIN x ON id BETWEEN CHILD_NODE_ID_FROM AND CHILD_NODE_ID_TO... |
|
SEQNR |
Specifies sort order of sibling nodes. |
For unknown reasons, SAP's data model enforces uniqueness on a technical auto-incrementing column called LINEID
rather than on the expected column, i.e. child node columns (SETNODE.SUBSETNAME
and SETLEAF.VALFROM
). This can lead to either duplicates or multi-parent scenarios, illustrated here with two simple examples:
(In this case, SETNAME
is the parent node, and SUBSETNAME
is the child node, as documented above.)
Generally speaking, the expected solution is to have data stewards in charge of SAP hierarchies fix problematic data in SAP, and then to re-process. Of course, you'll need to ensure some combination of:
- unique constraints,
- data quality checks, and/or
- exception handling
to identify such data quality issues before propagating to target tables.
SAP supports the management of "heterogenous" hierarchies in which different nodes are fundamentally distinct, i.e. represent different types of entities. For example, a "bill of materials" (BOM) hierarchy models the relationship of a product and all of its constituent components and sub-components, each of which may have drastically different attributes. As such, analytics against "heterogenous" hierarchies can be fraught with challenges, as there's not a shared data model amongst entities represented by each node.
Such heterogeneity is represented via the following SQL pseudocode from the table SETNODE
:
SELECT
...,
SETCLASS AS PARENT_NODE_HIER_TYPE_ID,
SUBCLASS AS PARENT_NODE_SUB_KEY_ID,
SUBSETCLS AS CHILD_NODE_HIER_TYPE_ID,
SUBSETSCLS AS CHILD_NODE_SUB_KEY_ID
FROM
SETNODE
WHERE
PARENT_NODE_HIER_TYPE_ID != CHILD_NODE_HIER_TYPE_ID OR -- different hierarchy types between levels
PARENT_NODE_SUB_KEY_ID != CHILD_NODE_SUB_KEY_ID -- different sub-keys between levels
For most use cases, such as financial reporting, that leverage SAP hierarchies, a reasonable assumption is that only "homogeneous" hierarchies are in scope (i.e. where every node corresponds with the same type of entity), though this should be confirmed with the business and profiled against the source data.
If there is a sufficient risk of heterogeneous hierarchies in your environment, then you'll likely want to introduce some combination of:
- unique constraints,
- data quality checks, and/or
- exception handling
to identify such data quality issues before propagating to target tables, and/or loading such heterogeneous hierarchies into a separate model (out of scope of TreeHouse).
Tip
If you have a need to persist such heterogeneous hierarchies in your target system and need help refactoring the code and data model accordingly, feel free to reach out to @jchesch.
The following data dictionary describes the table(s) and views available for modeling hierarchies in TreeHouse. As previously discussed, the emphasis in TreeHouse is on flattening hierarchies into a level structure, which practically is accomplished by populating the HIER_FLAT table and then parsing out level columns in the various views made available. For MDX consumption, HIER_PARENT_CHILD is available as a corresponding parent-child model.
Note
As noted previously, your needs might justify physically persisting what TreeHouse logically models (as views / UDTFs), or logically modeling what TreeHouse persists as physical tables.
Important
Throughout this documentation (and even in the code comments), the terms "flat" (or flatten) and "level" are used somewhat interchangeably, since they both represent transformations that are required to populate a target hierarchy data model that can be easily queried via SQL. However, these two terms are not exactly synonymous.
The source of the ambiguity stems from the analogy of "flattening". When we think of "flattening", we tend to think of taking something tall (and skinny), and making it shorter and wider. In the code, these are actually treated as two operations.
- The "shortening" aspect of flattening, i.e. reducing the record count, takes places when data is filtered down only to leaf-level records, i.e. as you can see in the filter
WHERE NODE_TYPE = 'LEAF'
in the demo coded provided in 05_RUN. - The "widening" aspect of flattening, i.e. extending the model with more columns, takes place when ancestor nodes are parsed out into dedicated columns for each level, i.e. as you find in the array parsing logic (i.e.
PATH_WITH_ATTRIBUTES[0]:NODE::VARCHAR AS LVL_1_NODE
) of the level hierarchy views such as ACCOUNT_HIER.
Efforts have been made to clarify the use of these terms, but there may still be some latent ambiguity in how they're used.
Following is a high-level data dictionary for the TreeHouse data model, including tables and views.
Descriptions below are short, i.e. for reference. Longer table and column descriptions can be found in the COMMENTS
of the associated DDL scripts.
STG_HIER
is an optional staging table that can be easily leveraged for populating either/both of the available target tables (HIER_FLAT and/or HIER_PARENT_CHILD) as demoed in the test/demo script 06_RUN in the Try It Out section.
Primary Key | Column Name | Description | Sample Value |
---|---|---|---|
X | CLIENT_ID |
Client ID, i.e. tenant identifier in multi-tenant systems (i.e. MANDT field from SAP) |
100 |
X | HIER_TYPE_ID |
Hierarchy type ID | 0106 |
HIER_TYPE_TEXT |
Hierarchy type description | Profit Center Groups | |
X | HIER_ID |
Hierarchy root node ID | P0001 |
HIER_TEXT |
Hierarchy description | Utilities & Facilities | |
X | SUB_KEY_ID |
Sub-key ID | KW01 |
SUB_KEY_TEXT |
Sub-key description | Controlling Area KW01 | |
PARENT_NODE_ID |
Parent Node ID | C1000004 | |
X | CHILD_NODE_ID |
(Child) Node ID | C100000403 |
CHILD_NODE_TEXT |
(Child) Node ID text description | Quality Control Lab | |
LEVEL |
Which hierarchy level this node is on | 3 | |
SEQNR |
Sequence number describing sort order of this node amongst sibling nodes | 5 | |
NODE_TYPE |
Categorizes nodes as ROOT, INNER or LEAF | INNER | |
PATH_WITH_ATTRIBUTES |
Path (array) of nodes (JSON objects) including text, level and seqnr attributes | [{"LEVEL":1, "NODE":"P0001", "NODE_TEXT": "Utilities & Facilities", "SEQNR": 0 }, ... ] |
Flattened hierarchy table, intended for logical and/or physical parsing into a fully-fledged level hierarchy model, i.e. for SQL consumption. A given record represents a leaf node and its associated unique identifier columns and descriptions, along with its full path (ancestor nodes) along with associated attributes in an array of JSON objects.
Primary Key | Column Name | Description | Sample Value |
---|---|---|---|
X | CLIENT_ID |
Client ID, i.e. tenant identifier in multi-tenant systems (i.e. MANDT field from SAP) |
100 |
X | HIER_TYPE_ID |
Hierarchy type ID | 0106 |
HIER_TYPE_TEXT |
Hierarchy type text | Profit Center Groups | |
X | HIER_ID |
Hierarchy root node ID | P0001 |
HIER_TEXT |
Hierarchy description | Utilities & Facilities | |
X | SUB_KEY_ID |
Sub-key ID | KW01 |
SUB_KEY_TEXT |
Sub-key text | Controlling Area KW01 | |
PATH_WITH_ATTRIBUTES |
JSON array of node-level attributes | [{"LEVEL":1, "NODE":"P0001", "NODE_TEXT": "Utilities & Facilities", "SEQNR": 0 }, ... ] | |
X | LEAF_NODE_ID |
Leaf node ID, used as join key | P000103045 |
Parent-child hierarchy table, generally for consumption by MDX and/or related dialects.
Primary Key | Column Name | Description | Sample Value |
---|---|---|---|
X | CLIENT_ID |
Client ID, i.e. tenant identifier in multi-tenant systems (i.e. MANDT field from SAP) |
100 |
X | HIER_TYPE_ID |
Hierarchy type ID | 0106 |
HIER_TYPE_TEXT |
Hierarchy type description | Profit Center Groups | |
X | HIER_ID |
Hierarchy root node ID | P0001 |
HIER_TEXT |
Hierarchy description | Utilities & Facilities | |
X | SUB_KEY_ID |
Sub-key ID | KW01 |
SUB_KEY_TEXT |
Sub-key text | Controlling Area KW01 | |
PARENT_NODE_ID |
Parent Node ID | ||
X | CHILD_NODE_ID |
(Child) Node ID | |
CHILD_NODE_ID_TEXT |
Node description | ||
SEQNR |
Sequence number describing sort order of this node amongst sibling nodes | 5 | |
LEVEL |
Which hierarchy level this node is on | 3 | |
NODE_TYPE |
Categorizes nodes as ROOT, INNER or LEAF | INNER |
Note
When MDX dialects recursively traverse a parent-child hierarchy, they only consider attributes tied to the "child" node, which is why you don't see PARENT_NODE_ID_TEXT
in the definition of HIER_PARENT_CHILD
(or any other attributes tied to parent nodes, such as PARENT_LEVEL
or PARENT_SEQNR
).
User-facing consumption views for hierarchies, intended for semantic layer / BI consumption, all follow the same structure, with different filters for different hierarchy types. Included in TreeHouse's test data and test consumption views, all easily extendable to other hierarchy types, are:
Following is a slightly generic representation of the ACCOUNT_HIER
view to demonstrate the basic concept, i.e. show how level columns and associated attributes are parsed out from the HIER_FLAT table along with the requisite hierarchy type filter (i.e. HIER_TYPE_ID = '0109'
).
Important
As defined below, this view would require a runtime filter to determine which account hierarchy a user wants to see. (Typically this is a passed in from a UI prompt that the user responds to.) Be careful about when/how such a filter is applied, as it can lead to unexpected behavior. Further discussion here.
CREATE OR REPLACE VIEW ACCOUNT_HIER
AS
SELECT
HIER_TYPE_ID AS HIER_TYPE_ID,
HIER_TYPE_TEXT AS HIER_TYPE_TEXT,
SUB_KEY_ID AS CHART_OF_ACCOUNTS, -- Alias SUB_KEY_ID back to its more meaningful business descriptiotn
SUB_KEY_TEXT AS CHART_OF_ACCOUNTS_TEXT,
HIER_ID AS HIERARCHY_ID,
HIER_TEXT AS HIERARCHY_TEXT,
PATH_WITH_ATTRIBUTES[0]:NODE::VARCHAR AS LVL_1_NODE,
PATH_WITH_ATTRIBUTES[0]:NODE_TEXT::VARCHAR AS LVL_1_TEXT,
PATH_WITH_ATTRIBUTES[0]:SEQNR::VARCHAR AS LVL_1_SEQNR,
...
PATH_WITH_ATTRIBUTES[N]:NODE::VARCHAR AS LVL_N_NODE,
PATH_WITH_ATTRIBUTES[N]:NODE_TEXT::VARCHAR AS LVL_N_TEXT,
PATH_WITH_ATTRIBUTES[N]:SEQNR::VARCHAR AS LVL_N_SEQNR,
LEAF_NODE_ID AS LEAF_NODE_ID
FROM
HIER_FLAT
WHERE
HIER_TYPE_ID = '0109';
As noted in the introduction, TreeHouse's reference implementation is via Snowflake, the ETL code of which is implemented via user-defined table functions (UDTFs). However, this logic and structure can easily be ported to other platforms. The important point is to understand which business logic is encapsulated in which table functions, so that the logic can be ported in a similarly reusable way (i.e. as dbt models, for example).
Also, the general expectation for deployment is that TreeHouse is orchestrated via user-managed configuration data (similar to the _CONFIG
temp table populated via 05_SET_CONFIG_DATA) that specifies which hierarchies should be loaded.
How exactly this process should be managed depends on your standards and preferences, but can be as simple as a well-governed CSV that's stored in S3, for example.
The following dependency graph illustrates the basic flow of business logic for transforming data from source tables, i.e. into a target hierarchy staging table*.
-- Recurse over hierarchies to calculate additional attributes and denormalize text descriptions.
-- Useful for staging data that can then easily populate target hierarchy table(s), i.e. parent-child *or* level hierarchy models.
UDTF_STG_HIER()
│
│ -- Return structured adjacency lists (parent-child hierarchies), i.e. by stacking "root", "inner" and "leaf" nofes
├─ UDTF_ADJACENCY_LIST()
│ │
│ │ -- Return leaf nodes (i.e. by joining SETLEAF to output of UDTF_NODES_ALL_POSSIBLE)
│ └─ UDTF_LEAF_NODES()
│ │
│ │ -- Return all possible leaf nodes (depending on hierarchy types (master data types) listed in config data)
│ └─ UDTF_NODES_ALL_POSSIBLE()
│
│ -- Get the text descriptions of the hierarchy types listed in config data
├─ UDTF_HIER_TYPE_TEXTS()
│
│ -- Get the text descriptions of the sub-key types listed in config data
├─ UDTF_SUB_KEY_TEXTS()
│
│ -- Get the text descriptions of all hierarchy nodes retrieved for hierarchies listed in config data
└─ UDTF_NODE_TEXTS()
*Such a staging table is optional, as UDTF_STG_HIER()
can be leveraged to populate consumption-layer hierarchy tables directly. However, given that there is more than one potential target data model (i.e. parent-child or flattened), and that the logic is trivial to load such tables from this staging table, it seems to make the most sense to have the TreeHouse codebase ultimately return data in the structure of a staging table, and then demonstrate how to load data from that staging table into each target data model (HIER_FLAT and HIER_PARENT_CHILD) by way of example code in 05_RUN in the demo workflow).
TreeHouse originated as a custom, refactored ETL pipeline for a major movie studio company that accelerated load runtimes from 1 hour down to 15 seconds. While that optimization was specific to the customer's data volumes (large) and data platform (SAP HANA), the concepts that drove the optimizations are highlighted below.
One key characteristic of TreeHouse is that the logic is entirely declarative. Developers are strongly encouraged to maintain this design decision, as imperative code (i.e. loops, conditionals, etc.) can cause significant decreases in performance when it comes to SQL and related dialects.
As you'll find in the TreeHouse codebase, almost all join conditions are compound joins. Many platforms automatically concatenate and hash the columns on each side of a compound join to accelerate performance, but it's worth validating whether this is the case on your platform (and if not - doing so manually - at least in the case of equijoins).
TreeHouse is designed with a strong emphasis on reusability and separation of concerns. More specifically, TreeHouse introduces and implements the concept of an interface pattern (modeled off of object-oriented interfaces) wherein a logical collection of tables and/or business logic conforms to a given output structure (the "interface") and an input parameter is used to drive conditional control flow, i.e. determine which table is queried an/or which business logic is executed.
For example:
UDTF_LEAF_NODES_ALL_POSSIBLE
returns any type of leaf nodes (master data) requested.UDTF_LEAF_NODES
abstracts and encapsulates the complex, dynamic join conditions fromSETLEAF
.UDTF_NODE_TEXTS
andUDTF_SUB_KEY_TEXTS
return any texts based on the requested type.
Here is pseudocode, showing a much simplified version of UDTF_NODE_TEXTS()
, demonstrating the concept of the interface pattern:
CREATE FUNCTION UDTF_NODE_TEXTS(TEXT_TYPE VARCHAR)
RETURNS
TABLE -- the "interface" that constituent queries must subscribe to
(
NODE_ID VARCHAR,
TEXT VARCHAR
)
AS
BEGIN
-- Conditional control flow via input parameter evaluated in WHERE clauses
RETURN
SELECT PRCTR AS NODE_ID, KTEXT AS TEXT FROM CEPCT WHERE :TEXT_TYPE = 'PROFIT_CENTER' UNION ALL
SELECT KOSTL AS NODE_ID, KTEXT AS TEXT FROM CSKT WHERE :TEXT_TYPE = 'COST_CENTER';
END
Tip
Conditional control flow can be implemented in at least 3 ways in most SQL dialects:
IF/ELSE
statements (imperative)WHERE
clause (declarative, set-based)CASE
statements and similar (declarative, row-based)
As a general rule, set-based declarative logic is more often than not going to provide the best performance, so in this case - implementing conditional control flow in WHERE
clauses.
What's important to point out is that TreeHouse, as written, actually implements the interface pattern with tabular data (i.e. the _CONFIG
temp table populated in 05_SET_CONFIG_DATA which is flattened into tabular structure in many of the UDTFs), which may or may not optimize as efficiently as filtering against a "scalar" input parameter, which you may need to evaluate further.
Filtering against a scalar parameter/variable has the added benefit of aligning well with (declarative) parallel orchestration, discussed next.
The current implementation of TreeHouse on Snowflake leverages implicit parallelization. Multiple hierarchies (as defined in the config data) are submitted for processing at the same time, and Snowflake's query optimizer handles parallelization, however extensive (or limited).
In contrast, the original commercial deployment of TreeHouse orchestrated declarative parallelization, specifying to the database that each and every hierarchy should be generated independently and in parallel. (This was accomplished via the MAP_MERGE() operator on the SAP HANA platform, similar to map()
functionality found in many orchestration/ETL tools from the "map/reduce" paradigm.)
Now with Snowflake, theoretically (according to Snowflake support), executing table functions (i.e. the type of object TreeHouse's ETL is currently implemented in) will similarly fan out to parallel execution, when a column of values is passed in as an argument for what's specified as a scalar input parameter. See examples here. That being said, Snowflake's UDTFs have various limitations, one of which is that certain query patterns get compiled into correlated sub-queries, which are not supported. Such is the case with TreeHouse, and thus any such declarative parallelization is not currently available.
That being said, if/when/as developers find a need to improve performance via such declarative parallelization, it shouldn't take too much effort to refactor the code as something like dbt models that can then be parallelized via parallel calls in your orchestration tooling of choice. Dynamic tasks from Astronomer (and Airflow), for example, demonstrate how to orchestrate such parallelization.
Tip
Performance may further improve by refactoring the treatment of config data as tuples, i.e. scalar values, rather than tabular data (by improving the evaluation of branch pruning, for example), which coalesces well with explicit parallelization, i.e. where each parallel call passes in a tuple argument.
Reach out to @jchesch if you need further assistance with this.
Following are design decisions that you may need to change when implementing TreeHouse, depending on your requirements, standards, etc.
- Namespace: Note that no specific databases/schemas have been referenced in this code. Update accordingly to align with your namespacing structure.
- Locale Settings: Presumably you have a strategy for managing SAP user & locale settings, i.e. default client, language, etc. - you'll need to retrofit TreeHouse to following your same standards (which currently are managed by the example config data you can find in the test scripts).
- Text Descriptions: TreeHouse denormalizes "short" descriptions from various text tables (i.e.
KTEXT
fromCSKT
, i.e. the short text from the Cost Center text tables). If long texts (i.e.CSKS.LTEXT
) are needed, they'll need to be added to the respective code. - Time Dependency: Various SAP master data and text tables can support time dependency, i.e. may persist multiple version of the same record over time, similar to SCD2s. It's rare for SAP customers to manage history of master data records related to hierarchies, but if that's the case with you, then code will need to be added to handle according to your requirements. Similarly, if history is NOT captured in the SAP source, but changes need to be tracked in your target system, then code will also need to be added to track SCD2s from changes made between ETL loads.
- Semantic Modeling: Hierarchies are generally modeled in a semantic layer for BI consumption. Whether you have such a semantic layer, or are modeling hierarchies through some other means (i.e. database views), you'll need to decide both how to model different joins to different hierarchy types, as well as how to filter to specific hierarchies and sub-keys.
- Here is an example solution approach:
- A view is defined for each hierarchy type (i.e. the view definitions provided) and joined to transaction/fact table foreign key fields.
- Locale settings, i.e.
CLIENT_ID
andLANG
in config data, are filtered in the physically persisted data (i.e. in STG_HIER)). - A runtime input parameter is defined in the semantic layer / BI tool for filtering against
HIER_ID
and/orHIER_ID_TEXT
. - A runtime input parameter is defined in the semantic layer / BI tool for filtering against
SUB_KEY_ID
and/orSUB_KEY_ID_TEXT
(unless this filtering is accomplished by joining to a corresponding field in the transaction/fact table.)
- Here is an example solution approach:
Tip
It's important to validate how your semantic model is defined given the risk of incorrect results. If referential integrity is not guaranteed between your transaction/fact table and hierarchy tables, then of course the respective join type should likely be LEFT OUTER
. That being said, any filters applied to hierarchies (or dimensions) after such a join will render the join as an INNER
join. In other words, ensure that your hierarchy models (i.e. semantic layer / views) are filtered first before such LEFT OUTER
joins are executed to ensure expected results are returned. This can often be achieved with parameterized database objects (i.e. parameterized views and/or table functions) wherein runtime filters are manually pushed down (depending on your data platform / semantic layer / BI tool).
Important
Various other minor design decisions have been made throughout the code based on typical industry SAP usage, data engineering best practices, etc. Nonetheless, and as with adoption of any open source software, ensure you conduct your own due diligence by reviewing the codebase to ensure any/all such decisions (documented in comments) align with your requirements and standards.
Tip
If you need any help implementing any design decisions discussed above, such as modeling hierarchies as slowly changing dimensions, or any other help, feel free to reach out to @jchesch.
Here are the steps you can follow to try out TreeHouse, assuming you have a Snowflake instance available. (If not, get a free trial here.)
Tip
If you're using Snowflake in production and don't yet have DevOps capabilities established, you can automate the execution of DDL/DML scripts via Snowflake DevOps, Snowflake Terraform, or a wide-variety of 3rd party open-source or commercial tools.
Run the DDLs made available in 01_SOURCE_DDL.
Note
In order to mitigate any legal risk associated with notoriously challenging license terms from SAP, sample source DDLs with fully synthetic data have been provided. They only include the columns required for TreeHouse.
Load the sample data provided by running the 02_SOURCE_DATA script.
Create the provided target tables and views via the provided DDL scripts. 03_TARGET_DDL points you to the correct location in the repo.
Create the provided table functions via the provided DDL scripts. 04_ETL points you to the correct location in the repo. Make sure to run them in the order provided.
Create a temp table to simulate user-provided config data via 05_SET_CONFIG_DATA.
Check out how the results look by querying the provided views:
Note
In many cases, SAP hierarchies have unbalanced hierarchies, where not all paths reach the same leaf level. This isn't necessarily a problem, as joins should always be executed against the LEAF_NODE_ID
column, but some users may expect to see ancestor nodes coalesced down to any NULL
inner nodes within any such unbalanced paths.
TreeHouse can easily be expanded to accommodate any type of hierarchy natively supported by SAP.
Hierarchy types map directly to master data entities. Thus, TreeHouse can easily be extended by:
- Adding additional master data tables to the CTE named
_cte_ALL_POSSIBLE_LEAF_NODES
inUDTF_LEAF_NODES_ALL_POSSIBLE
. - Adding additional sub-key text tables to the CTE named
_cte_SUB_KEY_ID_TEXTS
inUDTF_SUB_KEY_TEXTS
. - Adding corresponding text tables to the CTE named
_cte_NODE_TEXTS
inUDTF_NODE_TEXTS
.
Also, note that TreeHouse may (possibly) need to be extended to accommodate additional join conditions specified in SETLEAF.VALFROM
. See the comments in UDTF_LEAF_NODES()
for more details.
Tip
Given the common patterns followed by any given master data, text, and/or sub-key table from SAP for populating hierarchies, it's worth noting that much (if not all) of the TreeHouse codebase can be abstracted into metadata-driven code, i.e. where additional hierarchy types, sub-key types, etc. can simply be added to a configuration file, and the target ETL code can thus be auto-generated.