|
| 1 | +- Feature Name: Add MySQL as a data sink in the databuilder |
| 2 | +- Start Date: 2021-02-17 |
| 3 | +- RFC PR: [amundsen-io/rfcs#23](https://github.com/amundsen-io/rfcs/pull/23) |
| 4 | +- Amundsen Issue: [amundsen-io/amundsen#0000](https://github.com/amundsen-io/amundsen/issues/0000) (leave this empty for now) |
| 5 | + |
| 6 | +# Add MySQL as a data sink in the databuilder |
| 7 | + |
| 8 | +## Summary |
| 9 | + |
| 10 | +This RFC aims to add multiple modules, like serializer, data loader and publisher to load metadata into MySQL. |
| 11 | + |
| 12 | +## Motivation |
| 13 | + |
| 14 | +Currently, ETL flow works for graph databases. In order to support MySQL as a ackend metadata store, we need add new data loader, and publisher to push extracted metadata into MySQL. |
| 15 | + |
| 16 | +## Guide-level Explanation (aka Product Details) |
| 17 | + |
| 18 | +With the new serializer, data loader, publisher, users could invoke them in ETL job of databuilder if they are using MySQL as the backend store. |
| 19 | + |
| 20 | +## UI/UX-level Explanation |
| 21 | + |
| 22 | +N/A |
| 23 | + |
| 24 | +## Reference-level Explanation (aka Technical Details) |
| 25 | + |
| 26 | +1.`mysql_serializer`: it serializes the record instances which are in the format of MySQL ORM |
| 27 | +[models](https://github.com/amundsen-io/amundsenrds) to dictionary. It mainly converts all metadata attributes of record to |
| 28 | +key value paris in dictionary and excludes their SQLAlchemy attribute, '\_sa\_instance\_state'. |
| 29 | + |
| 30 | +2.`file_system_mysql_csv_loader`: it is used to generate csv files from extracted and serialized metadata record instances. |
| 31 | +This data loader invokes `mysql_serializer `first to get record in dictionary format and would get table name from |
| 32 | +the record instance by calling its `__tablename__` attribute. |
| 33 | + |
| 34 | +We forced the record\_iterator to yield record instance in the order of topography order(e.g., database -> cluster -> schema -> table -> column), |
| 35 | +and the data loader will output csv files in the same order. In `mysql_publisher`, however, the python function `listdir()` can not list files |
| 36 | +in that order. My plan is to add index with a separator in the file name(e.g., 1-table\_name) to make all files can be sorted first based on the index when listed in `mysql_publisher`. |
| 37 | +The index will be set by the length of `file_mapping` dictionary, which is used for csv writer in the new `file_system_mysql_csv_loader`. |
| 38 | + |
| 39 | +The format of the generated csv files will be same as the mysql table records, namely, field names on header row and then field values on the subsequent rows. |
| 40 | + |
| 41 | +3.`mysql_csv_publisher`: it reads csv files with the given directory and get ORM model class from the table name in the file name. |
| 42 | +The publisher will also set extra attributes(published\_tag, publisher\_last\_updated\_epoch\_ms) for the record instances before calling SQLAlchemy ORM methods to upsert csv records to mysql with the given transaction size. |
| 43 | + |
| 44 | +4.`mysql_search_data_extractor`: it is going to extract ORM objects with their SQLAlchemy relationships from mysql and convert them to current ElasticSearch document objects. Then it can work with the existing `FSElasticsearchJSONLoader` and `ElasticsearchPublisher` |
| 45 | +to execute the following es_publish job. It will support `table`, `dashboard`, `user` data search and the extraction from mysql would be in the method level instead of query language level. |
| 46 | +I.e., there will be methods like `search_tables()`, `search_dashboards()`, `search_users()` dispatched based on the config. The functionality of each method will refer to |
| 47 | +the CQL in the current [neo4j\_search\_data\_extractor](https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/extractor/neo4j_search_data_extractor.py#L23-L115). |
| 48 | + |
| 49 | +Note: the new data extractor may not support custom SQL for search currently. |
| 50 | + |
| 51 | + |
| 52 | +## Drawbacks |
| 53 | + |
| 54 | +`mysql_search_data_extractor` would get joins among multiple tables involved when fetching mysql data, which can not perform as fast as that in graph databases. |
| 55 | +On the other hand, it may not support custom SQL from config to generate es document objects currently and will call default table/dashboard/user search method instead. |
| 56 | + |
| 57 | +## Alternatives |
| 58 | + |
| 59 | +1.Instead of adding an index in csv file name, we can list files based on their create\_time in `mysql_csv_publisher`. However, the method getting file create_time, like `os.path.getctime()` or `stat().st_ctime` only returns the change time of the file metadata, or only works for a specific OS. An another possible solution is to maintain a constant reflecting the topological order of table names which can be used to read files, but it will need extra |
| 60 | +maintenance. |
| 61 | + |
| 62 | +2.Instead of reading table name from csv filename, we can put the table name into serialized dictionary and then in csv files, but it will need a little extra work, |
| 63 | +converting table name to model class for each record and exclude table name field when publisher dumps csv content into MySQL ORM models. |
| 64 | + |
| 65 | + |
| 66 | +## Prior art |
| 67 | + |
| 68 | +N/A |
| 69 | + |
| 70 | +## Unresolved questions |
| 71 | + |
| 72 | +N/A |
| 73 | + |
| 74 | +## Future possibilities |
| 75 | + |
| 76 | +According to the offline talk with Tao Feng, we will consider to call metadata API for loading extracted metadata into mysql sink in the future. |
0 commit comments