-
Couldn't load subscription status.
- Fork 1
[GEN-2025] table schema update staging mode #201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future consideration, for scripts that haven't been linted before. I suggest having a PR to lint, then branch off of that PR so that we can easily review any critical changes that are unrelated to linting changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 Great work here. I didn't look super closely at the logic, but it is much better to have non production runs just update the staging tables so it closer resembles what actually happens when we run production.
| checkbox_index = df_choices.index[df_choices.type == "checkbox"] | ||
| df.loc[non_checkbox_index,'synColSize'] = df_choices.loc[non_checkbox_index,'max_len'] | ||
| # for non-checkbox type, update synColSize only since no new columns will be added | ||
| if len(non_checkbox_index) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add conditions to update non_checkbox column size to align with checkbox columns.
|
@danlu1 One of the acceptance criteria for this ticket was to document what the scripts are doing. Did you apply this in the way of updating the boilerplate, docstrings of the scripts and the |
| vars_to_update["synColSize"] = vars_to_update["max_len"] | ||
|
|
||
| # Update numCols and colLabels for checkbox type if any changes | ||
| vars_checkbox_update = vars_with_choices.query( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simplify this section.
| vars_checkbox_update = vars_with_choices.query( | ||
| 'type == "checkbox" and ((choices_num > numCols) | (numCols.isna()))' | ||
| ) | ||
| vars_checkbox_update["synColSize"] = vars_checkbox_update["max_len"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also update synColSize for checkbox variables for later subsetting.
| vars_checkbox_update["numCols"] = vars_checkbox_update["choices_num"] | ||
| vars_checkbox_update["colLabels"] = vars_checkbox_update["choices_key"] | ||
|
|
||
| # Combine variables for update, avoiding duplicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the logic of combining checkbox and non-checkbox variables so it's more straightforward.
| # update: variable, synColSize, numCols | ||
| # add: variable, instrument, dataType='curated', type, label, cohort-dd, synColType, synColSize, numCols | ||
| # update: variable, synColSize, numCols | ||
| if not dry_run: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this section to adapt the new methods of table manipulation specified in synapseclient tutorial.
| ) | ||
| # TODO: variables to update; waiting for stats team | ||
| vars_to_update_df = "" | ||
| vars_to_update_df = pandas.DataFrame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set this to empty dataframe to resolve the type issue.
@rxu17 To achieve this, I tried to extend the docstrings, comments in the code and README. Also, I'm working on a confluence page and is halfway through. Will tag you when it's ready. |
| return pandas.DataFrame(temp_df_list) | ||
|
|
||
|
|
||
| def _get_latest_table(form) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is added to extract the latest table for a form to be modified if any new columns needed to be added.
| tbl_with_least_cols = current_cols_df['table_id'].value_counts() | ||
| tbl_with_least_cols_id = tbl_with_least_cols.idxmin() | ||
| tbl_with_least_cols_ct = tbl_with_least_cols.min() | ||
| # get the table id for the newest table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified to get the newest table instead.
| syn, table_id=TABLE_INFO["bpc"][0], condition=TABLE_INFO["bpc"][1] | ||
| ) | ||
| bpc_table_view = bpc_table_view[["id", "name"]] | ||
| ## comment out irr for now in case it is needed for phase 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented out for now because we are not sure if we will need to update irr tables in phase 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for doing this! A lot of hefty and complex code that you had to trudge through and modify/test! Left comments
| Get the synapse id of the latest table for each form | ||
| Args: | ||
| form (tuple): the tuple of the form and its corresponding tables outputed by pandas.DataFrame.groupby function, such as master_table_view.groupby("form"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this form(tuple) could be variable length of tables in it, based on this description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm could you put this example in the docstring or confluence doc (and maybe link to it)? It seems like a more complex data struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will put this in the confluence page.
| return form[1]["id"].values[0] | ||
| else: | ||
| # get the table with latest numeric suffix | ||
| latest_part_number = max(form[1]["name"].str.extract(r"(\d+)$")[0].astype(int)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too familiar with this, how exactly does this get the latest table? What is latest_part_number This is a synapse table attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the number in the table name. For example, for Prissmm Pathology tables, it will give 6 becasue Prissmm Pathology Part 6 has the largest Part number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the example in the code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so you're saying Prissmm Pathology Part 6 is the latest table over Prissmm Pathology Part 5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logically speaking yes. I confirmed with Chelsea here. "Simply tell from the table name, I assume we will only save new columns and rows to a newer table (part 3) if older tables (part 1 and 2) are full (either due to row or column number restrictions)." and Chelsea confirmed "True"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 Great work and reviews here. I'm going to leave final review to @rxu17
|
@danlu1 Not sure if you see this on your end, but Github will "hide" PR comments (if it becomes too long) if you look at the main PR page vs when you go into the Files Changed tab
|
|
@rxu17 Thanks for catching this. Most of the comments have been addressed. However, there are a few comments are hidden and I'm working on them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!


Problem:
Currently, schema updates in the BPC tables are being applied directly to production whenever changes are required. This approach introduces unnecessary risk, as any issues with the updates—such as mismatched schemas, dropped columns, or data integrity problems—immediately impact production data.
In addition, the pipeline scripts need to be updated to reflect recent changes:
Solution:
synapse_loginfunctions.Testing: