TreeHouse

TreeHouse
Disclaimer
Introduction
Prerequisites
Goals
Data Model
- Source
  - Data Dictionary
  - Data Quality
    - Duplicates & Multi-Parent
    - Heterogeneous Hierarchies
- Target
  - Data Dictionary
    - Tables
    - Views
ETL
Try It Out
Extending TreeHouse

TreeHouse

TreeHouse™ is an open source data product that models, transforms and flattens hierarchical data (i.e. tree structures) for data warehouse / lakehouse systems, primarily to simplify and accelerate BI and reporting use cases that require hierarchies.

TreeHouse is currently scoped to SAP hierarchies in particular.

Disclaimer

The TreeHouse project is an independent open-source initiative and is not affiliated with, endorsed, or sponsored by SAP SE or its affiliates. SAP® and other SAP product names are trademarks or registered trademarks of SAP SE (or its affiliates) in Germany and other countries.

Introduction

TreeHouse™ is an open source data product that models, transforms and flattens hierarchical data (i.e. tree structures) for data warehouse / lakehouse systems, primarily to simplify and accelerate BI and reporting use cases that require hierarchies (i.e. financial reporting).

TreeHouse is currently scoped to SAP hierarchies in particular.

The need for a product like TreeHouse arises from the fact that SAP's pre-packaged data warehouse content, including hierarchy data models and ETL code, is only available from SAP's integrated data warehouse products, i.e. Business Warehouse and Datasphere. Thus, SAP customers integrating SAP hierarchy data into 3rd party data platforms such as Snowflake or Databricks (or even stand-alone products like SAP HANA) are forced to reverse engineer the exact same logic that's been solved for years in SAP's own solutions.

TreeHouse thus solves this particular need, providing an end-to-end, fully open source data product for modeling and transforming SAP hierarchies for any target SQL-based (or even MDX-based) data platform.

Prerequisites

Developers implementing TreeHouse should be familiar with:

3NF and dimensional data modeling
hierarchies* (and related terminology)
intermediate SQL (including recursive common table expressions (CTEs))

While not required, it is helpful for developers to have familiarity with the SAP GUI, ABAP Data Dictionary, and helpful online resources such as se80.co.uk.

Tip

Most LLMs have a relatively strong understanding of the SAP data model and SQL (including recursive CTEs), and thus can generally be leveraged to help answer questions related to TreeHouse. But you're also welcome to reach out to @jchesch with any questions.

Tip

Developers unfamiliar with hierarchies may benefit from this 6-part series from Snap Analytics.

Goals

The goals of TreeHouse are three-fold:

1. Semantic model: capture and share the end-to-end ETL business logic and data models required for transforming and persisting SAP hierarchies that can easily be implemented in any legacy or modern data stack.

2. Simplicity: modularize the hierarchy "meta-model", drastically simplifying the overall codebase.

3. Performance: share parallelizable orchestration logic that can maximize performance.

TreeHouse is currently implemented in Snowflake in order to provide the community with a reference implementation on one of the more widely adoped platforms (assuming a platform-specific implementation is more helpful than generic ISO/ANSI SQL code).

Note

TreeHouse is written primarily in standard SQL on Snowflake, making it easily portable to other platforms. To try out TreeHouse, spin up a free trial of Snowflake and check out the Try It Out section.

That being the case, TreeHouse is NOT particularly opinionated about:

Data platform (Snowflake, Databricks, BigQuery, Redshift, etc.)
Cross-cutting concerns (input validation, exception handling, logging, orchestration, namespaces, etc.)
Implementation (UDTFs, stored procedures, dbt/SQLMesh models, language, etc.)
Physical data model (i.e. physical tables vs. logical views)

Such decisions and concerns should be addressed via customer-specific standards.

Tip

TreeHouse provides hiearchy data models for both SQL and MDX consumption. Worth highlighting is Cube.dev's MDX API that can help bridge the gap between MDX front-ends and SQL back-ends for "hybrid" BI stacks.

Data Model

Source

The standard ABAP data dictionary entries for SAP hierarchy tables are rather lacking. Thus, a custom data dictionary has been provided below for the core hierarchy tables from SAP (SETHEADER, SETNODE and SETLEAF).

Note

The terms "hierarchy type" and "master data type" are used somewhat interchangeably, since hierarchy types correspond directly with master data types. For example, a "proft center hierarchy" corresponds with "profit center master data".

Data Dictionary

Note that only those columns relevant to hierarchy processing are documented below. (See the ABAP data dictionary, or sites like se80.co.uk, for descriptions of additional columns if/as needed.)

SETHEADER

SETHEADER can be thought of as master data for all root and inner nodes. In the context of SAP hierarchies, it's only really needed for retrieving root nodes. (Inner nodes need to come from SETNODE due to additional required attributes that are not found in SETHEADER.)

Primary Key	Column Name	Alias	Description
X	`MANDT`	`CLIENT_ID`	SAP Client ID (unique per tenant)
X	`SETCLASS`	`HIER_TYPE_ID`	Hierarchy type ID (i.e. '0106' represents "Profit Center Group", '0101' represents "Cost Center Group", etc.) - See lookup table `SETCLST` for more.
X	`SUBCLASS`	`SUB_KEY_ID`	The second field of standard SAP unique identifiers (compound primary keys), where relevant. It corresponds with compound key fields like `KTOPL` (Chart of Accounts), `KOKRS` (Controlling Area), etc. from various master data tables.
X	`SETNAME`	`NODE_ID`	Unique identifier for each SAP hierarchy node (excluding leaf nodes). Sometimes aliased as `CHILD_NODE_ID` to match other data sources when `UNION`ed.
	`SETTYPE`		4 possible values used for filtering: S, D, B, M. S - Single dimension, i.e. "Profit Center" for example. D - Data, i.e. key figure sets. Out of scope for TreeHouse. B - Basic set, points directly to leaf nodes basically. M - Multidimensional. Out of scope for TreeHouse.

SETNODE

SETNODE stores parent-child relationships for inner nodes.

Primary Key	Column Name	Alias	Description
X	`MANDT`	`CLIENT_ID`	SAP Client ID (unique per tenant)
X	`SETCLASS`	`HIER_TYPE_ID`	Hierarchy type ID (i.e. '0106' represents "Profit Center Group", '0101' represents "Cost Center Group", etc.) - See lookup table `SETCLST` for more.
X	`SUBCLASS`	`SUB_KEY_ID`	The second field of standard SAP unique identifiers (compound primary keys), where relevant. It corresponds with compound key fields like `KTOPL` (Chart of Accounts), `KOKRS` (Controlling Area), etc.
X	`SETNAME`	`NODE_ID`	Unique identifier for each SAP hierarchy node (excluding leaf nodes). In the context of `SETNODE`, it is the "parent" node, whereas `SUBSETNAME` below is the "child" node.
X	`LINEID`		Not used in the context of hierarchies. Can cause duplicates and multi-parent hierarchies, which need to be managed carefully. See Data Quality discussion below.
	`SUBSETCLS`		Irrelevant for TreeHouse. See Data Quality discussion below.
	`SUBSETSCLS`		Irrelevant for TreeHouse. See Data Quality discussion below.
	`SUBSETNAME`	`CHILD_NODE_ID`	Child `NODE_ID` for a given parent/child relationship in `SETNODE`, i.e. `SETNAME`/`SUBSETNAME`.
	`SEQNR`		Sequence number for capturing the sort order of a given node (i.e. amongst its siblings)

SETLEAF

SETLEAF stores relationships between "penultimate" (second-to-last) nodes and leaf nodes. The historical motivation for this design from SAP is saving on storage, i.e. storing a single record that provides wildcard values and/or ranges that can join to many right-hand records from master data tables.

Primary Key	Column Name	Alias	Description
X	`MANDT`	`CLIENT_ID`	SAP Client ID (unique per tenant)
X	`SETCLASS`	`HIER_TYPE_ID`	Hierarchy type ID (i.e. '0106' represents "Profit Center Group", '0101' represents "Cost Center Group", etc.) - See lookup table `SETCLST` for more.
X	`SUBCLASS`	`SUB_KEY_ID`	The second field of standard SAP unique identifiers (compound primary keys), where relevant. It corresponds with compound key fields like `KTOPL` (Chart of Accounts), `KOKRS` (Controlling Area), etc.
X	`SETNAME`	`NODE_ID`	Unique identifier for each SAP hierarchy node (excluding leaf nodes). Also aliased as `CHILD_NODE_ID` in the context of hierarchical relationships.
X	`LINEID`		Not used in the context of hierarchies. Can cause duplicates and multi-parent hierarchies, which need to be managed carefully. See Data Quality discussion below.
	`VALOPTION`		Join condition type (`EQ`, `BT`, `CP`).
	`VALFROM`	`CHILD_NODE_ID_FROM`	Leaf node (retrieved based on dynamic `INNER JOIN` logic determined by `VALOPTION`).
	`VALTO`	`CHILD_NODE_ID_TO`	Used in range join when `VALOPTION = 'BT'`, i.e. `... INNER JOIN x ON id BETWEEN CHILD_NODE_ID_FROM AND CHILD_NODE_ID_TO...`
	`SEQNR`		Specifies sort order of sibling nodes.

Data Quality

Duplicates & Multi-Parent

For unknown reasons, SAP's data model enforces uniqueness on a technical auto-incrementing column called LINEID rather than on the expected column, i.e. child node columns (SETNODE.SUBSETNAME and SETLEAF.VALFROM). This can lead to either duplicates or multi-parent scenarios, illustrated here with two simple examples:

(In this case, SETNAME is the parent node, and SUBSETNAME is the child node, as documented above.)

Generally speaking, the expected solution is to have data stewards in charge of SAP hierarchies fix problematic data in SAP, and then to re-process. Of course, you'll need to ensure some combination of:

unique constraints,
data quality checks, and/or
exception handling

to identify such data quality issues before propagating to target tables.

Heterogeneous Hierarchies

SAP supports the management of "heterogenous" hierarchies in which different nodes are fundamentally distinct, i.e. represent different types of entities. For example, a "bill of materials" (BOM) hierarchy models the relationship of a product and all of its constituent components and sub-components, each of which may have drastically different attributes. As such, analytics against "heterogenous" hierarchies can be fraught with challenges, as there's not a shared data model amongst entities represented by each node.

Such heterogeneity is represented via the following SQL pseudocode from the table SETNODE:

SELECT 
  ...,
  SETCLASS   AS PARENT_NODE_HIER_TYPE_ID,
  SUBCLASS   AS PARENT_NODE_SUB_KEY_ID,
  SUBSETCLS  AS CHILD_NODE_HIER_TYPE_ID,
  SUBSETSCLS AS CHILD_NODE_SUB_KEY_ID
FROM 
  SETNODE
WHERE 
  PARENT_NODE_HIER_TYPE_ID != CHILD_NODE_HIER_TYPE_ID OR -- different hierarchy types between levels
  PARENT_NODE_SUB_KEY_ID != CHILD_NODE_SUB_KEY_ID        -- different sub-keys between levels

For most use cases, such as financial reporting, that leverage SAP hierarchies, a reasonable assumption is that only "homogeneous" hierarchies are in scope (i.e. where every node corresponds with the same type of entity), though this should be confirmed with the business and profiled against the source data.

If there is a sufficient risk of heterogeneous hierarchies in your environment, then you'll likely want to introduce some combination of:

unique constraints,
data quality checks, and/or
exception handling

to identify such data quality issues before propagating to target tables, and/or loading such heterogeneous hierarchies into a separate model (out of scope of TreeHouse).

Tip

If you have a need to persist such heterogeneous hierarchies in your target system and need help refactoring the code and data model accordingly, feel free to reach out to @jchesch.

Target

The following data dictionary describes the table(s) and views available for modeling hierarchies in TreeHouse. As previously discussed, the emphasis in TreeHouse is on flattening hierarchies into a level structure, which practically is accomplished by populating the HIER_FLAT table and then parsing out level columns in the various views made available. For MDX consumption, HIER_PARENT_CHILD is available as a corresponding parent-child model.

Note

As noted previously, your needs might justify physically persisting what TreeHouse logically models (as views / UDTFs), or logically modeling what TreeHouse persists as physical tables.

Important

Throughout this documentation (and even in the code comments), the terms "flat" (or flatten) and "level" are used somewhat interchangeably, since they both represent transformations that are required to populate a target hierarchy data model that can be easily queried via SQL. However, these two terms are not exactly synonymous.

The source of the ambiguity stems from the analogy of "flattening". When we think of "flattening", we tend to think of taking something tall (and skinny), and making it shorter and wider. In the code, these are actually treated as two operations.

The "shortening" aspect of flattening, i.e. reducing the record count, takes places when data is filtered down only to leaf-level records, i.e. as you can see in the filter WHERE NODE_TYPE = 'LEAF' in the demo coded provided in 05_RUN.
The "widening" aspect of flattening, i.e. extending the model with more columns, takes place when ancestor nodes are parsed out into dedicated columns for each level, i.e. as you find in the array parsing logic (i.e. PATH_WITH_ATTRIBUTES[0]:NODE::VARCHAR AS LVL_1_NODE) of the level hierarchy views such as ACCOUNT_HIER.

Efforts have been made to clarify the use of these terms, but there may still be some latent ambiguity in how they're used.

Data Dictionary

Following is a high-level data dictionary for the TreeHouse data model, including tables and views.

Descriptions below are short, i.e. for reference. Longer table and column descriptions can be found in the COMMENTS of the associated DDL scripts.

Tables

STG_HIER

STG_HIER is an optional staging table that can be easily leveraged for populating either/both of the available target tables (HIER_FLAT and/or HIER_PARENT_CHILD) as demoed in the test/demo script 06_RUN in the Try It Out section.

Primary Key	Column Name	Description	Sample Value
X	`CLIENT_ID`	Client ID, i.e. tenant identifier in multi-tenant systems (i.e. `MANDT` field from SAP)	100
X	`HIER_TYPE_ID`	Hierarchy type ID	0106
	`HIER_TYPE_TEXT`	Hierarchy type description	Profit Center Groups
X	`HIER_ID`	Hierarchy root node ID	P0001
	`HIER_TEXT`	Hierarchy description	Utilities & Facilities
X	`SUB_KEY_ID`	Sub-key ID	KW01
	`SUB_KEY_TEXT`	Sub-key description	Controlling Area KW01
	`PARENT_NODE_ID`	Parent Node ID	C1000004
X	`CHILD_NODE_ID`	(Child) Node ID	C100000403
	`CHILD_NODE_TEXT`	(Child) Node ID text description	Quality Control Lab
	`LEVEL`	Which hierarchy level this node is on	3
	`SEQNR`	Sequence number describing sort order of this node amongst sibling nodes	5
	`NODE_TYPE`	Categorizes nodes as ROOT, INNER or LEAF	INNER
	`PATH_WITH_ATTRIBUTES`	Path (array) of nodes (JSON objects) including text, level and seqnr attributes	[{"LEVEL":1, "NODE":"P0001", "NODE_TEXT": "Utilities & Facilities", "SEQNR": 0 }, ... ]

HIER_FLAT

Flattened hierarchy table, intended for logical and/or physical parsing into a fully-fledged level hierarchy model, i.e. for SQL consumption. A given record represents a leaf node and its associated unique identifier columns and descriptions, along with its full path (ancestor nodes) along with associated attributes in an array of JSON objects.

Primary Key	Column Name	Description	Sample Value
X	`CLIENT_ID`	Client ID, i.e. tenant identifier in multi-tenant systems (i.e. `MANDT` field from SAP)	100
X	`HIER_TYPE_ID`	Hierarchy type ID	0106
	`HIER_TYPE_TEXT`	Hierarchy type text	Profit Center Groups
X	`HIER_ID`	Hierarchy root node ID	P0001
	`HIER_TEXT`	Hierarchy description	Utilities & Facilities
X	`SUB_KEY_ID`	Sub-key ID	KW01
	`SUB_KEY_TEXT`	Sub-key text	Controlling Area KW01
	`PATH_WITH_ATTRIBUTES`	JSON array of node-level attributes	[{"LEVEL":1, "NODE":"P0001", "NODE_TEXT": "Utilities & Facilities", "SEQNR": 0 }, ... ]
X	`LEAF_NODE_ID`	Leaf node ID, used as join key	P000103045

HIER_PARENT_CHILD

Parent-child hierarchy table, generally for consumption by MDX and/or related dialects.

Primary Key	Column Name	Description	Sample Value
X	`CLIENT_ID`	Client ID, i.e. tenant identifier in multi-tenant systems (i.e. `MANDT` field from SAP)	100
X	`HIER_TYPE_ID`	Hierarchy type ID	0106
	`HIER_TYPE_TEXT`	Hierarchy type description	Profit Center Groups
X	`HIER_ID`	Hierarchy root node ID	P0001
	`HIER_TEXT`	Hierarchy description	Utilities & Facilities
X	`SUB_KEY_ID`	Sub-key ID	KW01
	`SUB_KEY_TEXT`	Sub-key text	Controlling Area KW01
	`PARENT_NODE_ID`	Parent Node ID
X	`CHILD_NODE_ID`	(Child) Node ID
	`CHILD_NODE_ID_TEXT`	Node description
	`SEQNR`	Sequence number describing sort order of this node amongst sibling nodes	5
	`LEVEL`	Which hierarchy level this node is on	3
	`NODE_TYPE`	Categorizes nodes as ROOT, INNER or LEAF	INNER

Note

When MDX dialects recursively traverse a parent-child hierarchy, they only consider attributes tied to the "child" node, which is why you don't see PARENT_NODE_ID_TEXT in the definition of HIER_PARENT_CHILD (or any other attributes tied to parent nodes, such as PARENT_LEVEL or PARENT_SEQNR).

Views

User-facing consumption views for hierarchies, intended for semantic layer / BI consumption, all follow the same structure, with different filters for different hierarchy types. Included in TreeHouse's test data and test consumption views, all easily extendable to other hierarchy types, are:

Following is a slightly generic representation of the ACCOUNT_HIER view to demonstrate the basic concept, i.e. show how level columns and associated attributes are parsed out from the HIER_FLAT table along with the requisite hierarchy type filter (i.e. HIER_TYPE_ID = '0109').

Important

As defined below, this view would require a runtime filter to determine which account hierarchy a user wants to see. (Typically this is a passed in from a UI prompt that the user responds to.) Be careful about when/how such a filter is applied, as it can lead to unexpected behavior. Further discussion here.

CREATE OR REPLACE VIEW ACCOUNT_HIER
AS
SELECT
    HIER_TYPE_ID                               AS HIER_TYPE_ID,
    HIER_TYPE_TEXT                             AS HIER_TYPE_TEXT,
    SUB_KEY_ID                                 AS CHART_OF_ACCOUNTS, -- Alias SUB_KEY_ID back to its more meaningful business descriptiotn
    SUB_KEY_TEXT                               AS CHART_OF_ACCOUNTS_TEXT,
    HIER_ID                                    AS HIERARCHY_ID,
    HIER_TEXT                                  AS HIERARCHY_TEXT,
    PATH_WITH_ATTRIBUTES[0]:NODE::VARCHAR      AS LVL_1_NODE,
    PATH_WITH_ATTRIBUTES[0]:NODE_TEXT::VARCHAR AS LVL_1_TEXT,
    PATH_WITH_ATTRIBUTES[0]:SEQNR::VARCHAR     AS LVL_1_SEQNR,
    ...
    PATH_WITH_ATTRIBUTES[N]:NODE::VARCHAR      AS LVL_N_NODE,
    PATH_WITH_ATTRIBUTES[N]:NODE_TEXT::VARCHAR AS LVL_N_TEXT,
    PATH_WITH_ATTRIBUTES[N]:SEQNR::VARCHAR     AS LVL_N_SEQNR,
    LEAF_NODE_ID                               AS LEAF_NODE_ID
FROM
    HIER_FLAT
WHERE
    HIER_TYPE_ID = '0109';

ETL

As noted in the introduction, TreeHouse's reference implementation is via Snowflake, the ETL code of which is implemented via user-defined table functions (UDTFs). However, this logic and structure can easily be ported to other platforms. The important point is to understand which business logic is encapsulated in which table functions, so that the logic can be ported in a similarly reusable way (i.e. as dbt models, for example).

Also, the general expectation for deployment is that TreeHouse is orchestrated via user-managed configuration data (similar to the _CONFIG temp table populated via 05_SET_CONFIG_DATA) that specifies which hierarchies should be loaded. How exactly this process should be managed depends on your standards and preferences, but can be as simple as a well-governed CSV that's stored in S3, for example.

Overview

The following dependency graph illustrates the basic flow of business logic for transforming data from source tables, i.e. into a target hierarchy staging table*.

-- Recurse over hierarchies to calculate additional attributes and denormalize text descriptions.
-- Useful for staging data that can then easily populate target hierarchy table(s), i.e. parent-child *or* level hierarchy models.
UDTF_STG_HIER() 
│
│  -- Return structured adjacency lists (parent-child hierarchies), i.e. by stacking "root", "inner" and "leaf" nofes
├─ UDTF_ADJACENCY_LIST()
│  │
│  │  -- Return leaf nodes (i.e. by joining SETLEAF to output of UDTF_NODES_ALL_POSSIBLE)
│  └─ UDTF_LEAF_NODES()
│     │
│     │  -- Return all possible leaf nodes (depending on hierarchy types (master data types) listed in config data)
│     └─ UDTF_NODES_ALL_POSSIBLE()
│
│  -- Get the text descriptions of the hierarchy types listed in config data
├─ UDTF_HIER_TYPE_TEXTS()
│
│  -- Get the text descriptions of the sub-key types listed in config data
├─ UDTF_SUB_KEY_TEXTS()
│
│  -- Get the text descriptions of all hierarchy nodes retrieved for hierarchies listed in config data
└─ UDTF_NODE_TEXTS()

*Such a staging table is optional, as UDTF_STG_HIER() can be leveraged to populate consumption-layer hierarchy tables directly. However, given that there is more than one potential target data model (i.e. parent-child or flattened), and that the logic is trivial to load such tables from this staging table, it seems to make the most sense to have the TreeHouse codebase ultimately return data in the structure of a staging table, and then demonstrate how to load data from that staging table into each target data model (HIER_FLAT and HIER_PARENT_CHILD) by way of example code in 05_RUN in the demo workflow).

Performance Considerations

TreeHouse originated as a custom, refactored ETL pipeline for a major movie studio company that accelerated load runtimes from 1 hour down to 15 seconds. While that optimization was specific to the customer's data volumes (large) and data platform (SAP HANA), the concepts that drove the optimizations are highlighted below.

Declarative logic

One key characteristic of TreeHouse is that the logic is entirely declarative. Developers are strongly encouraged to maintain this design decision, as imperative code (i.e. loops, conditionals, etc.) can cause significant decreases in performance when it comes to SQL and related dialects.

Joins

As you'll find in the TreeHouse codebase, almost all join conditions are compound joins. Many platforms automatically concatenate and hash the columns on each side of a compound join to accelerate performance, but it's worth validating whether this is the case on your platform (and if not - doing so manually - at least in the case of equijoins).

Control Flow

TreeHouse is designed with a strong emphasis on reusability and separation of concerns. More specifically, TreeHouse introduces and implements the concept of an interface pattern (modeled off of object-oriented interfaces) wherein a logical collection of tables and/or business logic conforms to a given output structure (the "interface") and an input parameter is used to drive conditional control flow, i.e. determine which table is queried an/or which business logic is executed.

For example:

UDTF_LEAF_NODES_ALL_POSSIBLE returns any type of leaf nodes (master data) requested.
UDTF_LEAF_NODES abstracts and encapsulates the complex, dynamic join conditions from SETLEAF.
UDTF_NODE_TEXTS and UDTF_SUB_KEY_TEXTS return any texts based on the requested type.

Here is pseudocode, showing a much simplified version of UDTF_NODE_TEXTS(), demonstrating the concept of the interface pattern:

CREATE FUNCTION UDTF_NODE_TEXTS(TEXT_TYPE VARCHAR)
RETURNS
  TABLE -- the "interface" that constituent queries must subscribe to 
  (
    NODE_ID    VARCHAR,
    TEXT       VARCHAR
  )
AS
BEGIN
  -- Conditional control flow via input parameter evaluated in WHERE clauses
  RETURN 
    SELECT PRCTR AS NODE_ID, KTEXT AS TEXT FROM CEPCT WHERE :TEXT_TYPE = 'PROFIT_CENTER' UNION ALL
    SELECT KOSTL AS NODE_ID, KTEXT AS TEXT FROM CSKT  WHERE :TEXT_TYPE = 'COST_CENTER';
END

Tip

Conditional control flow can be implemented in at least 3 ways in most SQL dialects:

IF/ELSE statements (imperative)
WHERE clause (declarative, set-based)
CASE statements and similar (declarative, row-based)

As a general rule, set-based declarative logic is more often than not going to provide the best performance, so in this case - implementing conditional control flow in WHERE clauses.

What's important to point out is that TreeHouse, as written, actually implements the interface pattern with tabular data (i.e. the _CONFIG temp table populated in 05_SET_CONFIG_DATA which is flattened into tabular structure in many of the UDTFs), which may or may not optimize as efficiently as filtering against a "scalar" input parameter, which you may need to evaluate further.

Filtering against a scalar parameter/variable has the added benefit of aligning well with (declarative) parallel orchestration, discussed next.

Parallel Orchestration

The current implementation of TreeHouse on Snowflake leverages implicit parallelization. Multiple hierarchies (as defined in the config data) are submitted for processing at the same time, and Snowflake's query optimizer handles parallelization, however extensive (or limited).

In contrast, the original commercial deployment of TreeHouse orchestrated declarative parallelization, specifying to the database that each and every hierarchy should be generated independently and in parallel. (This was accomplished via the MAP_MERGE() operator on the SAP HANA platform, similar to map() functionality found in many orchestration/ETL tools from the "map/reduce" paradigm.)

Now with Snowflake, theoretically (according to Snowflake support), executing table functions (i.e. the type of object TreeHouse's ETL is currently implemented in) will similarly fan out to parallel execution, when a column of values is passed in as an argument for what's specified as a scalar input parameter. See examples here. That being said, Snowflake's UDTFs have various limitations, one of which is that certain query patterns get compiled into correlated sub-queries, which are not supported. Such is the case with TreeHouse, and thus any such declarative parallelization is not currently available.

That being said, if/when/as developers find a need to improve performance via such declarative parallelization, it shouldn't take too much effort to refactor the code as something like dbt models that can then be parallelized via parallel calls in your orchestration tooling of choice. Dynamic tasks from Astronomer (and Airflow), for example, demonstrate how to orchestrate such parallelization.

Tip

Performance may further improve by refactoring the treatment of config data as tuples, i.e. scalar values, rather than tabular data (by improving the evaluation of branch pruning, for example), which coalesces well with explicit parallelization, i.e. where each parallel call passes in a tuple argument.

Reach out to @jchesch if you need further assistance with this.

Design Decisions

Following are design decisions that you may need to change when implementing TreeHouse, depending on your requirements, standards, etc.

Namespace: Note that no specific databases/schemas have been referenced in this code. Update accordingly to align with your namespacing structure.
Locale Settings: Presumably you have a strategy for managing SAP user & locale settings, i.e. default client, language, etc. - you'll need to retrofit TreeHouse to following your same standards (which currently are managed by the example config data you can find in the test scripts).
Text Descriptions: TreeHouse denormalizes "short" descriptions from various text tables (i.e. KTEXT from CSKT, i.e. the short text from the Cost Center text tables). If long texts (i.e. CSKS.LTEXT) are needed, they'll need to be added to the respective code.
Time Dependency: Various SAP master data and text tables can support time dependency, i.e. may persist multiple version of the same record over time, similar to SCD2s. It's rare for SAP customers to manage history of master data records related to hierarchies, but if that's the case with you, then code will need to be added to handle according to your requirements. Similarly, if history is NOT captured in the SAP source, but changes need to be tracked in your target system, then code will also need to be added to track SCD2s from changes made between ETL loads.
Semantic Modeling: Hierarchies are generally modeled in a semantic layer for BI consumption. Whether you have such a semantic layer, or are modeling hierarchies through some other means (i.e. database views), you'll need to decide both how to model different joins to different hierarchy types, as well as how to filter to specific hierarchies and sub-keys.
- Here is an example solution approach:
  - A view is defined for each hierarchy type (i.e. the view definitions provided) and joined to transaction/fact table foreign key fields.
  - Locale settings, i.e. CLIENT_ID and LANG in config data, are filtered in the physically persisted data (i.e. in STG_HIER)).
  - A runtime input parameter is defined in the semantic layer / BI tool for filtering against HIER_ID and/or HIER_ID_TEXT.
  - A runtime input parameter is defined in the semantic layer / BI tool for filtering against SUB_KEY_ID and/or SUB_KEY_ID_TEXT (unless this filtering is accomplished by joining to a corresponding field in the transaction/fact table.)

Tip

It's important to validate how your semantic model is defined given the risk of incorrect results. If referential integrity is not guaranteed between your transaction/fact table and hierarchy tables, then of course the respective join type should likely be LEFT OUTER. That being said, any filters applied to hierarchies (or dimensions) after such a join will render the join as an INNER join. In other words, ensure that your hierarchy models (i.e. semantic layer / views) are filtered first before such LEFT OUTER joins are executed to ensure expected results are returned. This can often be achieved with parameterized database objects (i.e. parameterized views and/or table functions) wherein runtime filters are manually pushed down (depending on your data platform / semantic layer / BI tool).

Important

Various other minor design decisions have been made throughout the code based on typical industry SAP usage, data engineering best practices, etc. Nonetheless, and as with adoption of any open source software, ensure you conduct your own due diligence by reviewing the codebase to ensure any/all such decisions (documented in comments) align with your requirements and standards.

Tip

If you need any help implementing any design decisions discussed above, such as modeling hierarchies as slowly changing dimensions, or any other help, feel free to reach out to @jchesch.

Try It Out

Here are the steps you can follow to try out TreeHouse, assuming you have a Snowflake instance available. (If not, get a free trial here.)

Tip

If you're using Snowflake in production and don't yet have DevOps capabilities established, you can automate the execution of DDL/DML scripts via Snowflake DevOps, Snowflake Terraform, or a wide-variety of 3rd party open-source or commercial tools.

1. Create source tables

Run the DDLs made available in 01_SOURCE_DDL.

Note

In order to mitigate any legal risk associated with notoriously challenging license terms from SAP, sample source DDLs with fully synthetic data have been provided. They only include the columns required for TreeHouse.

2. Load test data

Load the sample data provided by running the 02_SOURCE_DATA script.

3. Create target data model

Create the provided target tables and views via the provided DDL scripts. 03_TARGET_DDL points you to the correct location in the repo.

4. Create ETL functions

Create the provided table functions via the provided DDL scripts. 04_ETL points you to the correct location in the repo. Make sure to run them in the order provided.

5. Load sample config data

Create a temp table to simulate user-provided config data via 05_SET_CONFIG_DATA.

6. Generate hierarchies

Generate the hierarchies! 🎉

7. Query the results

Check out how the results look by querying the provided views:

Note

In many cases, SAP hierarchies have unbalanced hierarchies, where not all paths reach the same leaf level. This isn't necessarily a problem, as joins should always be executed against the LEAF_NODE_ID column, but some users may expect to see ancestor nodes coalesced down to any NULL inner nodes within any such unbalanced paths.

Extending TreeHouse

TreeHouse can easily be expanded to accommodate any type of hierarchy natively supported by SAP.

Hierarchy types map directly to master data entities. Thus, TreeHouse can easily be extended by:

Adding additional master data tables to the CTE named _cte_ALL_POSSIBLE_LEAF_NODES in UDTF_LEAF_NODES_ALL_POSSIBLE.
Adding additional sub-key text tables to the CTE named _cte_SUB_KEY_ID_TEXTS in UDTF_SUB_KEY_TEXTS.
Adding corresponding text tables to the CTE named _cte_NODE_TEXTS in UDTF_NODE_TEXTS.

Also, note that TreeHouse may (possibly) need to be extended to accommodate additional join conditions specified in SETLEAF.VALFROM. See the comments in UDTF_LEAF_NODES() for more details.

Tip

Given the common patterns followed by any given master data, text, and/or sub-key table from SAP for populating hierarchies, it's worth noting that much (if not all) of the TreeHouse codebase can be abstracted into metadata-driven code, i.e. where additional hierarchy types, sub-key types, etc. can simply be added to a configuration file, and the target ETL code can thus be auto-generated.

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
ddl		ddl
images		images
test		test
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

License

jchesch/sap-hierarchy-flattener

Folders and files

Latest commit

History

Repository files navigation

TreeHouse

Disclaimer

Introduction

Prerequisites

Goals

Data Model

Source

Data Dictionary

SETHEADER

SETNODE

SETLEAF

Data Quality

Duplicates & Multi-Parent

Heterogeneous Hierarchies

Target

Data Dictionary

Tables

STG_HIER

HIER_FLAT

HIER_PARENT_CHILD

Views

ETL

Overview

Performance Considerations

Declarative logic

Joins

Control Flow

Parallel Orchestration

Design Decisions

Try It Out

1. Create source tables

2. Load test data

3. Create target data model

4. Create ETL functions

5. Load sample config data

6. Generate hierarchies

7. Query the results

Extending TreeHouse

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages