|
| 1 | +- Feature Name: lineage_stage_0 |
| 2 | +- Start Date: 2021-02-22 |
| 3 | +- RFC PR: [amundsen-io/rfcs#25](https://github.com/amundsen-io/rfcs/pull/25) |
| 4 | +- Amundsen Issue: [amundsen-io/amundsen#0000](https://github.com/amundsen-io/amundsen/issues/0000) |
| 5 | +# Amundsen Lineage - Stage 0 |
| 6 | + |
| 7 | +## Summary |
| 8 | + |
| 9 | + |
| 10 | +Currently Amundsen doesn't have a way of surfacing lineage information for tables and columns. The idea for this first iteration is to have a way to show upstream and downstream tables and columns to users through the Table Details page so they can explore the current resource's lineage as well as navigate to related resources in Amundsen. |
| 11 | +The first iteration is meant to be a fast implementation of the feature that we can get feedback on and improve in future iterations. |
| 12 | + |
| 13 | +## Motivation |
| 14 | + |
| 15 | +Lineage is essential to improving data discovery in Amundsen because it allows users to know where the data for a given resource is coming from as well as where this data is used downstream. |
| 16 | + |
| 17 | + |
| 18 | +## Guide-level Explanation (aka Product Details) |
| 19 | + |
| 20 | +### New Concepts |
| 21 | +- Lineage: Lineage is a term that describes the flow of data from one entity to another. While this term can broadly include everything from services, events, ETLs, and dashboards, we will focus on table-to-table and column-to-column data lineage in this RFC. |
| 22 | +- Upstream: Upstream is a relative term that describes data sources from which we inherit. Data flows from upstream to downstream. |
| 23 | +- Downstream: Downstream is a relative term that describes data entities which consume our data. |
| 24 | + |
| 25 | +This feature will expose upstream and downstream tables and columns within the `Table Details` page. |
| 26 | + |
| 27 | +Those implementing Amundsen should keep in mind that this feature is meant to provide them with a way to surface their existing lineage data by calling the service containing that data from the metadata service. This iteration won't provide a model to persist lineage on neo4j, but rather a gateway to lineage data so it can be included on lineage API responses to displayed in frontend. It is also important to understand that the feature will be disabled by default and can be enable through configuration. |
| 28 | + |
| 29 | + |
| 30 | +## UI/UX-level Explanation |
| 31 | + |
| 32 | + |
| 33 | + |
| 34 | +We will add two additional tabs to the `Table Details` page, `Upstream` and `Downstream`. Each tab will contain a list of tables from which data is inherited or consumed. This allows users view a table's lineage in a very simple manner. |
| 35 | + |
| 36 | + |
| 37 | + |
| 38 | +Additionally we will add lineage information at the column level, viewable by expanding column metadata. |
| 39 | + |
| 40 | +These features will only appear when the lineage feature is enabled. |
| 41 | + |
| 42 | +## Reference-level Explanation (aka Technical Details) |
| 43 | +### Architecture |
| 44 | + |
| 45 | + |
| 46 | + |
| 47 | +Implementing this feature will require defining a Lienage API on the metadata service for Tables and Columns. When the API is called it will make calls to neo4j and whatever the source of lineage data is. An interface needs to be created to interact with an implementer's lineage service in a generic way. The data from the calls to these services will be put together to form the lineage response as defined below. |
| 48 | + |
| 49 | +### Backend Implementation |
| 50 | + |
| 51 | +#### Table Lineage API |
| 52 | + |
| 53 | +_The table details page must list X levels of downstream and upstream dataset name, level, source (database), badges, on the DOWNSTREAM and UPSTREAM tabs. These datasets should also be sortable by usage._ |
| 54 | + |
| 55 | +When the user clicks the DOWNSTREAM or UPSTREAM tabs on the table details page, either of 2 requests to metadata will be executed containing lineage direction (upstream/downstream) and depth (levels): |
| 56 | + |
| 57 | +```https://amundsenmetadata.com/table/current_table_key/lineage?direction=upstream&depth=1``` |
| 58 | +OR |
| 59 | +```https://amundsenmetadata.com/table/current_table_key/lineage?direction=downstream&depth=1``` |
| 60 | +will be executed and the lineage call will return a response: |
| 61 | +``` |
| 62 | +{ |
| 63 | + “key”: “current_table_key”, |
| 64 | + “direction”: “upstream” |
| 65 | + “upstream_entities”: [ |
| 66 | + { |
| 67 | + “table”: “table_key1”, |
| 68 | + “level”: 1, |
| 69 | + "source": “hive”, |
| 70 | + “badges”: [“core”, “beta”], |
| 71 | + “usage”: 234, |
| 72 | + }, |
| 73 | + ... |
| 74 | + ], |
| 75 | + “downstream_entities”: [] |
| 76 | +} |
| 77 | +``` |
| 78 | +OR |
| 79 | +``` |
| 80 | +{ |
| 81 | + “key”: “current_table_key”, |
| 82 | + “direction”: “downstream” |
| 83 | + “upstream_entities”: [], |
| 84 | + “downstream_entities”: [ |
| 85 | + { |
| 86 | + “table”: “table_key2”, |
| 87 | + “level”: 1, |
| 88 | + "source": “hive”, |
| 89 | + “badges”: [], |
| 90 | + “usage”: 45, |
| 91 | + }, |
| 92 | + ... |
| 93 | + ] |
| 94 | +} |
| 95 | +``` |
| 96 | +#### Column Lineage API |
| 97 | +_The expanded view of a column in the table details page must display lists of upstream and downstream columns at the same time._ |
| 98 | +When the user expands the column to see more details 2 requests to metadata will be executed as follows: |
| 99 | +```https://amundsenmetadata.com/table/current_table_key/column/column_name/lineage?direction=both&depth=1``` |
| 100 | +and the lineage call will return a response: |
| 101 | +``` |
| 102 | +{ |
| 103 | + “key”: “current_table_key/current_column_name”, |
| 104 | + “direction”: “all” |
| 105 | + “upstream_entities”: [ |
| 106 | + { |
| 107 | + “key”: “table_key1/column_name1”, |
| 108 | + “level”: 1, |
| 109 | + "source": “hive”, |
| 110 | + “usage”: 234, |
| 111 | + }, |
| 112 | + ... |
| 113 | + ], |
| 114 | + “downstream_entities”: [ |
| 115 | + { |
| 116 | + “key”: “table_key2/column_name2”, |
| 117 | + “level”: 1, |
| 118 | + "source": “hive”, |
| 119 | + “usage”: 45, |
| 120 | + }, |
| 121 | + ... |
| 122 | + ] |
| 123 | +} |
| 124 | +``` |
| 125 | +## Drawbacks |
| 126 | +> Why should we _not_ do this? |
| 127 | +> Please consider: |
| 128 | +> Implementation cost, both in term of code size and complexity |
| 129 | +> Integration of this feature with other existing and planned features |
| 130 | +> The impact on onboarding and learning about Amundsen |
| 131 | +> Cost of migrating existing Amundsen installations (is it a breaking change?) |
| 132 | +> If there are tradeoffs to choosing any path. Attempt to identify them here. |
| 133 | +## Alternatives |
| 134 | +> Why is this design the best in the space of possible designs? |
| 135 | +> What other designs have been considered and what is the rationale for not choosing them? |
| 136 | +> What is the impact of not doing this? |
| 137 | +## Prior art |
| 138 | +> Discuss prior art, both the good and the bad, in relation to this proposal. A few examples of what this can include are: |
| 139 | +> Does this feature exist in other data search applications and what experience have their community had? |
| 140 | +> For community proposals: Is this done by some other community and what were their experiences with it? |
| 141 | +> Papers: Are there any published papers or great posts that discuss this? If you have some relevant papers to refer to, this can serve as a more detailed theoretical background. |
| 142 | +> This section is intended to encourage you as an author to think about the lessons from other projects, provide readers of your RFC with a fuller picture. If there is no prior art, that is fine - your ideas are interesting to us whether they are brand new or if it is an adaptation from other projects. |
| 143 | +## Unresolved questions |
| 144 | +> What parts of the design do you expect to resolve through the RFC process before this gets merged? |
| 145 | +> What parts of the design do you expect to resolve through the implementation of this feature before stabilization? |
| 146 | +> What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC? |
| 147 | +## Future possibilities |
| 148 | +> Think about what the natural extension and evolution of your proposal would be and how it would affect the project as a whole in a holistic way. Also consider how the this all fits into the roadmap for the project and of the relevant sub-team. |
| 149 | +> This is also a good place to "dump ideas", if they are out of scope for the RFC you are writing but otherwise related. |
| 150 | +> If you have tried and cannot think of any future possibilities, you may simply state that you cannot think of anything. |
| 151 | +- Persist lineage data on neo4j: create extractors for databuilder library to extract the data and publish it |
| 152 | +- Implement lineage graph view for better discovery experience |
| 153 | +- Introduce a Task entity on Amundsen to surface pipepline tasks between tables and column and understand what tasks are repsonsible for generating tables |
0 commit comments