Skip to content

Conversation

@danlu1
Copy link
Contributor

@danlu1 danlu1 commented Aug 21, 2025

Problem:

Currently, schema updates in the BPC tables are being applied directly to production whenever changes are required. This approach introduces unnecessary risk, as any issues with the updates—such as mismatched schemas, dropped columns, or data integrity problems—immediately impact production data.

In addition, the pipeline scripts need to be updated to reflect recent changes:

  1. Remove all Synapse configuration references, since credentials are now passed through the SYNAPSE_AUTH_TOKEN environment variable.
  2. Introduce a table version to indicate which data dictionary was used when updating the data element catalog table.
  3. Comment out all IRR-related sections, as we no longer intend to update the IRR table schema.
  4. Fix the logic in update_table_schema.py for selecting the target table to which new columns are appended, since it does not currently align with the intended usage.

Solution:

  1. Updated the synapse_login functions in utilities module and update corresponding functions in each impacted
  2. Update update_data_element_catalog and update_table_schema.py and update_data_table.py to adapt the new synapse_login functions.
  3. Tweak functions and logic to address problems mentioned in the above section.
  4. Add staging mode to update_table_schema.py and update_data_element_catalog.py

Testing:

  1. Successfully finished testing update_data_element_catalog.py.
  2. Successfully finished testing update_table_schema.py.

@danlu1 danlu1 requested a review from a team as a code owner August 21, 2025 20:25
@danlu1 danlu1 marked this pull request as draft August 21, 2025 20:25
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future consideration, for scripts that haven't been linted before. I suggest having a PR to lint, then branch off of that PR so that we can easily review any critical changes that are unrelated to linting changes.

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 Great work here. I didn't look super closely at the logic, but it is much better to have non production runs just update the staging tables so it closer resembles what actually happens when we run production.

@rxu17 rxu17 self-requested a review August 26, 2025 05:22
@danlu1 danlu1 marked this pull request as ready for review August 26, 2025 20:45
checkbox_index = df_choices.index[df_choices.type == "checkbox"]
df.loc[non_checkbox_index,'synColSize'] = df_choices.loc[non_checkbox_index,'max_len']
# for non-checkbox type, update synColSize only since no new columns will be added
if len(non_checkbox_index) > 0:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add conditions to update non_checkbox column size to align with checkbox columns.

@rxu17
Copy link
Contributor

rxu17 commented Aug 26, 2025

@danlu1 One of the acceptance criteria for this ticket was to document what the scripts are doing. Did you apply this in the way of updating the boilerplate, docstrings of the scripts and the table_updates README to contain that documentation or is it meant to be elsewhere?

vars_to_update["synColSize"] = vars_to_update["max_len"]

# Update numCols and colLabels for checkbox type if any changes
vars_checkbox_update = vars_with_choices.query(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplify this section.

vars_checkbox_update = vars_with_choices.query(
'type == "checkbox" and ((choices_num > numCols) | (numCols.isna()))'
)
vars_checkbox_update["synColSize"] = vars_checkbox_update["max_len"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also update synColSize for checkbox variables for later subsetting.

vars_checkbox_update["numCols"] = vars_checkbox_update["choices_num"]
vars_checkbox_update["colLabels"] = vars_checkbox_update["choices_key"]

# Combine variables for update, avoiding duplicates
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the logic of combining checkbox and non-checkbox variables so it's more straightforward.

# update: variable, synColSize, numCols
# add: variable, instrument, dataType='curated', type, label, cohort-dd, synColType, synColSize, numCols
# update: variable, synColSize, numCols
if not dry_run:
Copy link
Contributor Author

@danlu1 danlu1 Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this section to adapt the new methods of table manipulation specified in synapseclient tutorial.

)
# TODO: variables to update; waiting for stats team
vars_to_update_df = ""
vars_to_update_df = pandas.DataFrame()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set this to empty dataframe to resolve the type issue.

@danlu1
Copy link
Contributor Author

danlu1 commented Aug 26, 2025

@danlu1 One of the acceptance criteria for this ticket was to document what the scripts are doing. Did you apply this in the way of updating the boilerplate, docstrings of the scripts and the table_updates README to contain that documentation or is it meant to be elsewhere?

@rxu17 To achieve this, I tried to extend the docstrings, comments in the code and README. Also, I'm working on a confluence page and is halfway through. Will tag you when it's ready.

return pandas.DataFrame(temp_df_list)


def _get_latest_table(form) -> str:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is added to extract the latest table for a form to be modified if any new columns needed to be added.

tbl_with_least_cols = current_cols_df['table_id'].value_counts()
tbl_with_least_cols_id = tbl_with_least_cols.idxmin()
tbl_with_least_cols_ct = tbl_with_least_cols.min()
# get the table id for the newest table
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to get the newest table instead.

syn, table_id=TABLE_INFO["bpc"][0], condition=TABLE_INFO["bpc"][1]
)
bpc_table_view = bpc_table_view[["id", "name"]]
## comment out irr for now in case it is needed for phase 3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out for now because we are not sure if we will need to update irr tables in phase 3.

Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for doing this! A lot of hefty and complex code that you had to trudge through and modify/test! Left comments

Get the synapse id of the latest table for each form
Args:
form (tuple): the tuple of the form and its corresponding tables outputed by pandas.DataFrame.groupby function, such as master_table_view.groupby("form").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this form(tuple) could be variable length of tables in it, based on this description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this:
image

Copy link
Contributor

@rxu17 rxu17 Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm could you put this example in the docstring or confluence doc (and maybe link to it)? It seems like a more complex data struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will put this in the confluence page.

return form[1]["id"].values[0]
else:
# get the table with latest numeric suffix
latest_part_number = max(form[1]["name"].str.extract(r"(\d+)$")[0].astype(int))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with this, how exactly does this get the latest table? What is latest_part_number This is a synapse table attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the number in the table name. For example, for Prissmm Pathology tables, it will give 6 becasue Prissmm Pathology Part 6 has the largest Part number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the example in the code comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so you're saying Prissmm Pathology Part 6 is the latest table over Prissmm Pathology Part 5?

Copy link
Contributor Author

@danlu1 danlu1 Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically speaking yes. I confirmed with Chelsea here. "Simply tell from the table name, I assume we will only save new columns and rows to a newer table (part 3) if older tables (part 1 and 2) are full (either due to row or column number restrictions)." and Chelsea confirmed "True"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know!

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 Great work and reviews here. I'm going to leave final review to @rxu17

@danlu1 danlu1 requested a review from rxu17 September 3, 2025 17:41
@rxu17
Copy link
Contributor

rxu17 commented Sep 3, 2025

@danlu1 Not sure if you see this on your end, but Github will "hide" PR comments (if it becomes too long) if you look at the main PR page vs when you go into the Files Changed tab

image

@danlu1
Copy link
Contributor Author

danlu1 commented Sep 3, 2025

@rxu17 Thanks for catching this. Most of the comments have been addressed. However, there are a few comments are hidden and I'm working on them.
Majority of the remaining comments are regarding update data catalog elements table with SOR. Since we deprecated this subcommand, I didn't do a comprehensive polish about it but just debugged a data loading issue. Do you still think we should address them?

Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@danlu1 danlu1 merged commit 835ac6d into develop Sep 5, 2025
5 checks passed
@rxu17 rxu17 deleted the gen-2025-table-schema-update-staging-mode branch September 5, 2025 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants