Skip to content

Commit e52fb0d

Browse files
allisonsuarezDaniel Won
and
Daniel Won
authored
Lineage Stage 0 (#25)
* started writing rfc and added dir Signed-off-by: Allison Suarez Miranda <[email protected]> * made small change on README and added more to RFC will create PR to get number now Signed-off-by: Allison Suarez Miranda <[email protected]> * had to redo this Signed-off-by: Allison Suarez Miranda <[email protected]> * reverted weird changes Signed-off-by: Allison Suarez Miranda <[email protected]> * changed naming again Signed-off-by: Allison Suarez Miranda <[email protected]> * Added details to rfc 025 lineage. - Added photos into a nested assets/ folder. Signed-off-by: Daniel Won <[email protected]> * Fixed relative image links in rfc 025 Signed-off-by: Daniel Won <[email protected]> Co-authored-by: Daniel Won <[email protected]>
1 parent e7c09cd commit e52fb0d

File tree

4 files changed

+153
-0
lines changed

4 files changed

+153
-0
lines changed

assets/025/column-lineage-preview.png

116 KB
Loading

assets/025/lineage-arch.png

30.2 KB
Loading

assets/025/table-lineage-preview.png

327 KB
Loading

rfcs/025-lineage-stage-0.md

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
- Feature Name: lineage_stage_0
2+
- Start Date: 2021-02-22
3+
- RFC PR: [amundsen-io/rfcs#25](https://github.com/amundsen-io/rfcs/pull/25)
4+
- Amundsen Issue: [amundsen-io/amundsen#0000](https://github.com/amundsen-io/amundsen/issues/0000)
5+
# Amundsen Lineage - Stage 0
6+
7+
## Summary
8+
9+
10+
Currently Amundsen doesn't have a way of surfacing lineage information for tables and columns. The idea for this first iteration is to have a way to show upstream and downstream tables and columns to users through the Table Details page so they can explore the current resource's lineage as well as navigate to related resources in Amundsen.
11+
The first iteration is meant to be a fast implementation of the feature that we can get feedback on and improve in future iterations.
12+
13+
## Motivation
14+
15+
Lineage is essential to improving data discovery in Amundsen because it allows users to know where the data for a given resource is coming from as well as where this data is used downstream.
16+
17+
18+
## Guide-level Explanation (aka Product Details)
19+
20+
### New Concepts
21+
- Lineage: Lineage is a term that describes the flow of data from one entity to another. While this term can broadly include everything from services, events, ETLs, and dashboards, we will focus on table-to-table and column-to-column data lineage in this RFC.
22+
- Upstream: Upstream is a relative term that describes data sources from which we inherit. Data flows from upstream to downstream.
23+
- Downstream: Downstream is a relative term that describes data entities which consume our data.
24+
25+
This feature will expose upstream and downstream tables and columns within the `Table Details` page.
26+
27+
Those implementing Amundsen should keep in mind that this feature is meant to provide them with a way to surface their existing lineage data by calling the service containing that data from the metadata service. This iteration won't provide a model to persist lineage on neo4j, but rather a gateway to lineage data so it can be included on lineage API responses to displayed in frontend. It is also important to understand that the feature will be disabled by default and can be enable through configuration.
28+
29+
30+
## UI/UX-level Explanation
31+
32+
![Table Lineage Preview](../assets/025/table-lineage-preview.png)
33+
34+
We will add two additional tabs to the `Table Details` page, `Upstream` and `Downstream`. Each tab will contain a list of tables from which data is inherited or consumed. This allows users view a table's lineage in a very simple manner.
35+
36+
![Column Lineage Preview](../assets/025/column-lineage-preview.png)
37+
38+
Additionally we will add lineage information at the column level, viewable by expanding column metadata.
39+
40+
These features will only appear when the lineage feature is enabled.
41+
42+
## Reference-level Explanation (aka Technical Details)
43+
### Architecture
44+
45+
![Lineage Stage 0 Architecture](../assets/025/lineage-arch.png)
46+
47+
Implementing this feature will require defining a Lienage API on the metadata service for Tables and Columns. When the API is called it will make calls to neo4j and whatever the source of lineage data is. An interface needs to be created to interact with an implementer's lineage service in a generic way. The data from the calls to these services will be put together to form the lineage response as defined below.
48+
49+
### Backend Implementation
50+
51+
#### Table Lineage API
52+
53+
_The table details page must list X levels of downstream and upstream dataset name, level, source (database), badges, on the DOWNSTREAM and UPSTREAM tabs. These datasets should also be sortable by usage._
54+
55+
When the user clicks the DOWNSTREAM or UPSTREAM tabs on the table details page, either of 2 requests to metadata will be executed containing lineage direction (upstream/downstream) and depth (levels):
56+
57+
```https://amundsenmetadata.com/table/current_table_key/lineage?direction=upstream&depth=1```
58+
OR
59+
```https://amundsenmetadata.com/table/current_table_key/lineage?direction=downstream&depth=1```
60+
will be executed and the lineage call will return a response:
61+
```
62+
{
63+
“key”: “current_table_key”,
64+
“direction”: “upstream”
65+
“upstream_entities”: [
66+
{
67+
“table”: “table_key1”,
68+
“level”: 1,
69+
"source": “hive”,
70+
“badges”: [“core”, “beta”],
71+
“usage”: 234,
72+
},
73+
...
74+
],
75+
“downstream_entities”: []
76+
}
77+
```
78+
OR
79+
```
80+
{
81+
“key”: “current_table_key”,
82+
“direction”: “downstream”
83+
“upstream_entities”: [],
84+
“downstream_entities”: [
85+
{
86+
“table”: “table_key2”,
87+
“level”: 1,
88+
"source": “hive”,
89+
“badges”: [],
90+
“usage”: 45,
91+
},
92+
...
93+
]
94+
}
95+
```
96+
#### Column Lineage API
97+
_The expanded view of a column in the table details page must display lists of upstream and downstream columns at the same time._
98+
When the user expands the column to see more details 2 requests to metadata will be executed as follows:
99+
```https://amundsenmetadata.com/table/current_table_key/column/column_name/lineage?direction=both&depth=1```
100+
and the lineage call will return a response:
101+
```
102+
{
103+
“key”: “current_table_key/current_column_name”,
104+
“direction”: “all”
105+
“upstream_entities”: [
106+
{
107+
“key”: “table_key1/column_name1”,
108+
“level”: 1,
109+
"source": “hive”,
110+
“usage”: 234,
111+
},
112+
...
113+
],
114+
“downstream_entities”: [
115+
{
116+
“key”: “table_key2/column_name2”,
117+
“level”: 1,
118+
"source": “hive”,
119+
“usage”: 45,
120+
},
121+
...
122+
]
123+
}
124+
```
125+
## Drawbacks
126+
> Why should we _not_ do this?
127+
> Please consider:
128+
> Implementation cost, both in term of code size and complexity
129+
> Integration of this feature with other existing and planned features
130+
> The impact on onboarding and learning about Amundsen
131+
> Cost of migrating existing Amundsen installations (is it a breaking change?)
132+
> If there are tradeoffs to choosing any path. Attempt to identify them here.
133+
## Alternatives
134+
> Why is this design the best in the space of possible designs?
135+
> What other designs have been considered and what is the rationale for not choosing them?
136+
> What is the impact of not doing this?
137+
## Prior art
138+
> Discuss prior art, both the good and the bad, in relation to this proposal. A few examples of what this can include are:
139+
> Does this feature exist in other data search applications and what experience have their community had?
140+
> For community proposals: Is this done by some other community and what were their experiences with it?
141+
> Papers: Are there any published papers or great posts that discuss this? If you have some relevant papers to refer to, this can serve as a more detailed theoretical background.
142+
> This section is intended to encourage you as an author to think about the lessons from other projects, provide readers of your RFC with a fuller picture. If there is no prior art, that is fine - your ideas are interesting to us whether they are brand new or if it is an adaptation from other projects.
143+
## Unresolved questions
144+
> What parts of the design do you expect to resolve through the RFC process before this gets merged?
145+
> What parts of the design do you expect to resolve through the implementation of this feature before stabilization?
146+
> What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC?
147+
## Future possibilities
148+
> Think about what the natural extension and evolution of your proposal would be and how it would affect the project as a whole in a holistic way. Also consider how the this all fits into the roadmap for the project and of the relevant sub-team.
149+
> This is also a good place to "dump ideas", if they are out of scope for the RFC you are writing but otherwise related.
150+
> If you have tried and cannot think of any future possibilities, you may simply state that you cannot think of anything.
151+
- Persist lineage data on neo4j: create extractors for databuilder library to extract the data and publish it
152+
- Implement lineage graph view for better discovery experience
153+
- Introduce a Task entity on Amundsen to surface pipepline tasks between tables and column and understand what tasks are repsonsible for generating tables

0 commit comments

Comments
 (0)