[GEN-2025] table schema update staging mode #201

danlu1 · 2025-08-21T20:25:00Z

Problem:

Currently, schema updates in the BPC tables are being applied directly to production whenever changes are required. This approach introduces unnecessary risk, as any issues with the updates—such as mismatched schemas, dropped columns, or data integrity problems—immediately impact production data.

In addition, the pipeline scripts need to be updated to reflect recent changes:

Remove all Synapse configuration references, since credentials are now passed through the SYNAPSE_AUTH_TOKEN environment variable.
Introduce a table version to indicate which data dictionary was used when updating the data element catalog table.
Comment out all IRR-related sections, as we no longer intend to update the IRR table schema.
Fix the logic in update_table_schema.py for selecting the target table to which new columns are appended, since it does not currently align with the intended usage.

Solution:

Updated the synapse_login functions in utilities module and update corresponding functions in each impacted
Update update_data_element_catalog and update_table_schema.py and update_data_table.py to adapt the new synapse_login functions.
Tweak functions and logic to address problems mentioned in the above section.
Add staging mode to update_table_schema.py and update_data_element_catalog.py

Testing:

Successfully finished testing update_data_element_catalog.py.
Successfully finished testing update_table_schema.py.

scripts/table_updates/update_data_element_catalog.py

thomasyu888 · 2025-08-26T04:53:30Z

scripts/table_updates/update_table_schema.py

Future consideration, for scripts that haven't been linted before. I suggest having a PR to lint, then branch off of that PR so that we can easily review any critical changes that are unrelated to linting changes.

scripts/table_updates/update_table_schema.py

thomasyu888

🔥 Great work here. I didn't look super closely at the logic, but it is much better to have non production runs just update the staging tables so it closer resembles what actually happens when we run production.

scripts/table_updates/update_data_element_catalog.py

scripts/table_updates/update_data_table.py

scripts/table_updates/update_table_schema.py

scripts/table_updates/update_data_element_catalog.py

danlu1 · 2025-08-26T21:37:54Z

scripts/table_updates/update_data_element_catalog.py

    checkbox_index = df_choices.index[df_choices.type == "checkbox"]
-    df.loc[non_checkbox_index,'synColSize'] = df_choices.loc[non_checkbox_index,'max_len']
+    # for non-checkbox type, update synColSize only since no new columns will be added
+    if len(non_checkbox_index) > 0:


Add conditions to update non_checkbox column size to align with checkbox columns.

rxu17 · 2025-08-26T21:39:19Z

@danlu1 One of the acceptance criteria for this ticket was to document what the scripts are doing. Did you apply this in the way of updating the boilerplate, docstrings of the scripts and the table_updates README to contain that documentation or is it meant to be elsewhere?

danlu1 · 2025-08-26T21:40:13Z

scripts/table_updates/update_data_element_catalog.py

+    vars_to_update["synColSize"] = vars_to_update["max_len"]
+
+    # Update numCols and colLabels for checkbox type if any changes
+    vars_checkbox_update = vars_with_choices.query(


simplify this section.

danlu1 · 2025-08-26T21:41:13Z

scripts/table_updates/update_data_element_catalog.py

+    vars_checkbox_update = vars_with_choices.query(
+        'type == "checkbox" and ((choices_num > numCols) | (numCols.isna()))'
+    )
+    vars_checkbox_update["synColSize"] = vars_checkbox_update["max_len"]


Also update synColSize for checkbox variables for later subsetting.

scripts/table_updates/update_data_element_catalog.py

danlu1 · 2025-08-26T21:42:43Z

scripts/table_updates/update_data_element_catalog.py

+    vars_checkbox_update["numCols"] = vars_checkbox_update["choices_num"]
+    vars_checkbox_update["colLabels"] = vars_checkbox_update["choices_key"]
+
+    # Combine variables for update, avoiding duplicates


change the logic of combining checkbox and non-checkbox variables so it's more straightforward.

danlu1 · 2025-08-26T21:45:06Z

scripts/table_updates/update_data_element_catalog.py

-        # update: variable, synColSize, numCols
+    # add: variable, instrument, dataType='curated', type, label, cohort-dd, synColType, synColSize, numCols
+    # update: variable, synColSize, numCols
    if not dry_run:


Updated this section to adapt the new methods of table manipulation specified in synapseclient tutorial.

scripts/table_updates/update_data_element_catalog.py

danlu1 · 2025-08-26T21:46:44Z

scripts/table_updates/update_data_element_catalog.py

+    )
    # TODO: variables to update; waiting for stats team
-    vars_to_update_df = ""
+    vars_to_update_df = pandas.DataFrame()


set this to empty dataframe to resolve the type issue.

scripts/table_updates/README.md

danlu1 · 2025-08-26T21:51:56Z

@danlu1 One of the acceptance criteria for this ticket was to document what the scripts are doing. Did you apply this in the way of updating the boilerplate, docstrings of the scripts and the table_updates README to contain that documentation or is it meant to be elsewhere?

@rxu17 To achieve this, I tried to extend the docstrings, comments in the code and README. Also, I'm working on a confluence page and is halfway through. Will tag you when it's ready.

danlu1 · 2025-08-26T22:40:56Z

scripts/table_updates/update_table_schema.py

        return pandas.DataFrame(temp_df_list)

+
+def _get_latest_table(form) -> str:


This function is added to extract the latest table for a form to be modified if any new columns needed to be added.

scripts/table_updates/update_table_schema.py

danlu1 · 2025-08-26T23:04:18Z

scripts/table_updates/update_table_schema.py

-    tbl_with_least_cols = current_cols_df['table_id'].value_counts()
-    tbl_with_least_cols_id = tbl_with_least_cols.idxmin()
-    tbl_with_least_cols_ct = tbl_with_least_cols.min()
+    # get the table id for the newest table


Modified to get the newest table instead.

scripts/table_updates/update_table_schema.py

danlu1 · 2025-08-26T23:06:27Z

scripts/table_updates/update_table_schema.py

+        syn, table_id=TABLE_INFO["bpc"][0], condition=TABLE_INFO["bpc"][1]
+    )
+    bpc_table_view = bpc_table_view[["id", "name"]]
+    ## comment out irr for now in case it is needed for phase 3


Commented out for now because we are not sure if we will need to update irr tables in phase 3.

scripts/table_updates/README.md

rxu17

Thanks again for doing this! A lot of hefty and complex code that you had to trudge through and modify/test! Left comments

scripts/table_updates/update_data_element_catalog.py

scripts/table_updates/update_table_schema.py

rxu17 · 2025-08-27T20:58:50Z

scripts/table_updates/update_table_schema.py

+    Get the synapse id of the latest table for each form
+
+    Args:
+        form (tuple): the tuple of the form and its corresponding tables outputed by pandas.DataFrame.groupby function, such as master_table_view.groupby("form").


So this form(tuple) could be variable length of tables in it, based on this description?

It looks like this:

Hmm could you put this example in the docstring or confluence doc (and maybe link to it)? It seems like a more complex data struct.

I will put this in the confluence page.

scripts/table_updates/update_table_schema.py

rxu17 · 2025-08-27T21:05:23Z

scripts/table_updates/update_table_schema.py

+        return form[1]["id"].values[0]
+    else:
+        # get the table with latest numeric suffix
+        latest_part_number = max(form[1]["name"].str.extract(r"(\d+)$")[0].astype(int))


I'm not too familiar with this, how exactly does this get the latest table? What is latest_part_number This is a synapse table attribute?

This is the number in the table name. For example, for Prissmm Pathology tables, it will give 6 becasue Prissmm Pathology Part 6 has the largest Part number.

I added the example in the code comment.

Right, so you're saying Prissmm Pathology Part 6 is the latest table over Prissmm Pathology Part 5?

Logically speaking yes. I confirmed with Chelsea here. "Simply tell from the table name, I assume we will only save new columns and rows to a newer table (part 3) if older tables (part 1 and 2) are full (either due to row or column number restrictions)." and Chelsea confirmed "True"

Good to know!

scripts/table_updates/update_table_schema.py

thomasyu888

🔥 Great work and reviews here. I'm going to leave final review to @rxu17

rxu17 · 2025-09-03T18:28:26Z

@danlu1 Not sure if you see this on your end, but Github will "hide" PR comments (if it becomes too long) if you look at the main PR page vs when you go into the Files Changed tab

danlu1 · 2025-09-03T18:37:19Z

@rxu17 Thanks for catching this. Most of the comments have been addressed. However, there are a few comments are hidden and I'm working on them.
Majority of the remaining comments are regarding update data catalog elements table with SOR. Since we deprecated this subcommand, I didn't do a comprehensive polish about it but just debugged a data loading issue. Do you still think we should address them?

rxu17

LGTM!

danlu1 added 2 commits August 21, 2025 20:19

add staging mode to update data element catalog script

5f455bf

draft version of update_table_schema staging mode

5c6cb14

danlu1 requested a review from a team as a code owner August 21, 2025 20:25

danlu1 marked this pull request as draft August 21, 2025 20:25

thomasyu888 reviewed Aug 26, 2025

View reviewed changes

scripts/table_updates/update_data_element_catalog.py Show resolved Hide resolved

thomasyu888 reviewed Aug 26, 2025

View reviewed changes

scripts/table_updates/update_data_element_catalog.py Outdated Show resolved Hide resolved

thomasyu888 reviewed Aug 26, 2025

View reviewed changes

scripts/table_updates/update_table_schema.py Outdated Show resolved Hide resolved

thomasyu888 reviewed Aug 26, 2025

View reviewed changes

rxu17 self-requested a review August 26, 2025 05:22

rxu17 reviewed Aug 26, 2025

View reviewed changes

scripts/table_updates/update_data_element_catalog.py Show resolved Hide resolved

scripts/table_updates/update_data_element_catalog.py Show resolved Hide resolved

danlu1 added 9 commits August 26, 2025 16:26

pull the latest table to append new columns

ce10c6f

add examples for update_data_element_catalog.py

732364b

update synapse login function to remove config file part

bb4e58c

update synapse_login to not use config file

945884d

force snapshoting when table is updated

09aa756

force snapshoting when table is updated

ea4e950

fix typo

f1992cb

update login function

39cfe19

update login function

dc70d9d

danlu1 marked this pull request as ready for review August 26, 2025 20:45

update synapse_login function

5175271

danlu1 commented Aug 26, 2025

View reviewed changes

scripts/table_updates/update_data_element_catalog.py Show resolved Hide resolved

rxu17 reviewed Aug 26, 2025

View reviewed changes

danlu1 commented Aug 26, 2025

View reviewed changes

scripts/table_updates/update_data_element_catalog.py Outdated Show resolved Hide resolved

danlu1 commented Aug 26, 2025

View reviewed changes

scripts/table_updates/update_data_element_catalog.py Show resolved Hide resolved

danlu1 commented Aug 26, 2025

View reviewed changes

remove unwanted code

8cd739c

danlu1 commented Aug 26, 2025

View reviewed changes

scripts/table_updates/README.md Show resolved Hide resolved

danlu1 added 2 commits August 26, 2025 22:07

remove unwanted docstring

960a2a7

remove the debug lines

06b9468

danlu1 commented Aug 26, 2025

View reviewed changes

update docstrings and typing hints

8f1c409

danlu1 commented Aug 26, 2025

View reviewed changes

scripts/table_updates/update_table_schema.py Show resolved Hide resolved

danlu1 commented Aug 26, 2025

View reviewed changes

scripts/table_updates/update_table_schema.py Show resolved Hide resolved

danlu1 commented Aug 26, 2025

View reviewed changes

rxu17 reviewed Aug 27, 2025

View reviewed changes

scripts/table_updates/README.md Show resolved Hide resolved

rxu17 reviewed Aug 27, 2025

View reviewed changes

danlu1 added 5 commits September 2, 2025 19:10

add notes to help option

53f267b

add dry-run mode explaination

ce87ac0

extend readme

74ed058

expand docstring

2659e58

remove debug code

fd579e6

thomasyu888 approved these changes Sep 3, 2025

View reviewed changes

danlu1 requested a review from rxu17 September 3, 2025 17:41

danlu1 added 2 commits September 3, 2025 23:47

separate non-checkbox and checkbox variables for update

1ebd187

add column limit comments

b9734cd

rxu17 approved these changes Sep 5, 2025

View reviewed changes

danlu1 merged commit 835ac6d into develop Sep 5, 2025
5 checks passed

rxu17 deleted the gen-2025-table-schema-update-staging-mode branch September 5, 2025 22:03

		return pandas.DataFrame(temp_df_list)


		def _get_latest_table(form) -> str:

Uh oh!

[GEN-2025] table schema update staging mode #201

[GEN-2025] table schema update staging mode #201

Conversation

danlu1 commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Solution:

Testing:

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasyu888 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxu17 commented Aug 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danlu1 Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danlu1 commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rxu17 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxu17 Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danlu1 commented Aug 21, 2025 •

edited

Loading

danlu1 Aug 26, 2025 •

edited

Loading

danlu1 commented Aug 26, 2025 •

edited

Loading

rxu17 Sep 3, 2025 •

edited

Loading

danlu1 Sep 3, 2025 •

edited

Loading

danlu1 commented Sep 3, 2025 •

edited

Loading