Skip to content

Commit f0bed65

Browse files
authored
Add MySQL as a data sink in the databuilder (#23)
* Add MySQL as a data sink in the databuilder Signed-off-by: xuans <[email protected]> * Added pr number Signed-off-by: xuans <[email protected]>
1 parent e52fb0d commit f0bed65

File tree

1 file changed

+76
-0
lines changed

1 file changed

+76
-0
lines changed
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
- Feature Name: Add MySQL as a data sink in the databuilder
2+
- Start Date: 2021-02-17
3+
- RFC PR: [amundsen-io/rfcs#23](https://github.com/amundsen-io/rfcs/pull/23)
4+
- Amundsen Issue: [amundsen-io/amundsen#0000](https://github.com/amundsen-io/amundsen/issues/0000) (leave this empty for now)
5+
6+
# Add MySQL as a data sink in the databuilder
7+
8+
## Summary
9+
10+
This RFC aims to add multiple modules, like serializer, data loader and publisher to load metadata into MySQL.
11+
12+
## Motivation
13+
14+
Currently, ETL flow works for graph databases. In order to support MySQL as a ackend metadata store, we need add new data loader, and publisher to push extracted metadata into MySQL.
15+
16+
## Guide-level Explanation (aka Product Details)
17+
18+
With the new serializer, data loader, publisher, users could invoke them in ETL job of databuilder if they are using MySQL as the backend store.
19+
20+
## UI/UX-level Explanation
21+
22+
N/A
23+
24+
## Reference-level Explanation (aka Technical Details)
25+
26+
1.`mysql_serializer`: it serializes the record instances which are in the format of MySQL ORM
27+
[models](https://github.com/amundsen-io/amundsenrds) to dictionary. It mainly converts all metadata attributes of record to
28+
key value paris in dictionary and excludes their SQLAlchemy attribute, '\_sa\_instance\_state'.
29+
30+
2.`file_system_mysql_csv_loader`: it is used to generate csv files from extracted and serialized metadata record instances.
31+
This data loader invokes `mysql_serializer `first to get record in dictionary format and would get table name from
32+
the record instance by calling its `__tablename__` attribute.
33+
34+
We forced the record\_iterator to yield record instance in the order of topography order(e.g., database -> cluster -> schema -> table -> column),
35+
and the data loader will output csv files in the same order. In `mysql_publisher`, however, the python function `listdir()` can not list files
36+
in that order. My plan is to add index with a separator in the file name(e.g., 1-table\_name) to make all files can be sorted first based on the index when listed in `mysql_publisher`.
37+
The index will be set by the length of `file_mapping` dictionary, which is used for csv writer in the new `file_system_mysql_csv_loader`.
38+
39+
The format of the generated csv files will be same as the mysql table records, namely, field names on header row and then field values on the subsequent rows.
40+
41+
3.`mysql_csv_publisher`: it reads csv files with the given directory and get ORM model class from the table name in the file name.
42+
The publisher will also set extra attributes(published\_tag, publisher\_last\_updated\_epoch\_ms) for the record instances before calling SQLAlchemy ORM methods to upsert csv records to mysql with the given transaction size.
43+
44+
4.`mysql_search_data_extractor`: it is going to extract ORM objects with their SQLAlchemy relationships from mysql and convert them to current ElasticSearch document objects. Then it can work with the existing `FSElasticsearchJSONLoader` and `ElasticsearchPublisher`
45+
to execute the following es_publish job. It will support `table`, `dashboard`, `user` data search and the extraction from mysql would be in the method level instead of query language level.
46+
I.e., there will be methods like `search_tables()`, `search_dashboards()`, `search_users()` dispatched based on the config. The functionality of each method will refer to
47+
the CQL in the current [neo4j\_search\_data\_extractor](https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/extractor/neo4j_search_data_extractor.py#L23-L115).
48+
49+
Note: the new data extractor may not support custom SQL for search currently.
50+
51+
52+
## Drawbacks
53+
54+
`mysql_search_data_extractor` would get joins among multiple tables involved when fetching mysql data, which can not perform as fast as that in graph databases.
55+
On the other hand, it may not support custom SQL from config to generate es document objects currently and will call default table/dashboard/user search method instead.
56+
57+
## Alternatives
58+
59+
1.Instead of adding an index in csv file name, we can list files based on their create\_time in `mysql_csv_publisher`. However, the method getting file create_time, like `os.path.getctime()` or `stat().st_ctime` only returns the change time of the file metadata, or only works for a specific OS. An another possible solution is to maintain a constant reflecting the topological order of table names which can be used to read files, but it will need extra
60+
maintenance.
61+
62+
2.Instead of reading table name from csv filename, we can put the table name into serialized dictionary and then in csv files, but it will need a little extra work,
63+
converting table name to model class for each record and exclude table name field when publisher dumps csv content into MySQL ORM models.
64+
65+
66+
## Prior art
67+
68+
N/A
69+
70+
## Unresolved questions
71+
72+
N/A
73+
74+
## Future possibilities
75+
76+
According to the offline talk with Tao Feng, we will consider to call metadata API for loading extracted metadata into mysql sink in the future.

0 commit comments

Comments
 (0)