[SYNPY-1700] Updating recordset to include the `validation_file_handle_id` #1283

BryanFauble · 2025-11-24T21:49:32Z

Problem:

There is a new validation_file_handle_id field for the RecordSet that points to a validation CSV file denoting the specific issues with the validation rules.
Static type checking had issues with the TypeVar strings that we were using before, and not actually rendering out to Pandas types

Solution:

Adding the new field, and allowing the detailed results to come back via a Pandas DataFrame for analysis

Testing:

Integration testing to verify logic, I also verified that suggestions in the how-to guide for curator as well

BryanFauble · 2025-11-24T21:57:09Z

docs/guides/extensions/curator/metadata_curation.md

+## Step 4: Work with metadata and validate (Record-based workflow)
+
+After creating a record-based metadata task, collaborators can enter metadata through the Grid interface. Once metadata entry is complete, you'll want to validate the data against your schema and identify any issues.
+
+### The metadata curation workflow


I created https://sagebionetworks.jira.com/browse/SYNPY-1712 so we can revisit this and make doing this work a bit easier for folks

BryanFauble · 2025-11-24T21:58:02Z

synapseclient/models/curation.py

    def create(
        self,
-        attach_to_previous_session=True,
+        attach_to_previous_session=False,


I made this change because if you do have a previous session open, and you are not aware that this defaults to True it could drop you into a session you did not intend to and overwrite, or delete data in the RecordSet.

I see, so the default behavior is always going to be creation of a new session? Which means data within the working session previously won't be included?

In this case that is correct unless they change that boolean to True.

Users may also use the list method as well to get a list of their active grid sessions and select the correct one from there.

Hmm. I actually think the backend is going to change where the default is one session, and multiple people can connect and iterate on that one session.

We can figure it out then

synapseclient/models/recordset.py

thomasyu888 · 2025-11-25T17:27:29Z

synapseclient/core/typing_utils.py

This is fantastic

tests/integration/synapseclient/models/async/test_recordset_async.py

thomasyu888

🔥 Thanks for the great work here. I tagged @aditigopalan for a review as well, but I'll defer a final review to another person on the team.

…ple modules to ensure pandas is installed where needed

linglp

Hi @BryanFauble and @thomasyu888 ! The code looks good to me, but I am confused when I tested out the functionality. I started by creating a record set, a task, and a grid by using a test data schema:

record_set, task, grid = create_record_based_metadata_task(
    synapse_client=syn,
    project_id="syn71816385",
    folder_id="syn71816388",
    record_set_name="BiospecimenMetadata_RecordSet",
    record_set_description="RecordSet for biospecimen metadata curation",
    curation_task_name="BiospecimenMetadataTemplate",
    upsert_keys=["specimenID"],
    instructions="Please curate this metadata according to the schema requirements",
    schema_uri="sage.schemas.v2571-nf.BiospecimenTemplate.schema-9.14.0"
)

I entered some data in the grid on the UI, but when I tried to download the validation result:

async def main():
    syn = Synapse()
    syn.login()
    # Assuming you have a RecordSet with a bound schema
    record_set = await RecordSet(id="syn71816405").get_async()
    # Create and export Grid session to generate validation results
    grid = await Grid(record_set_id=record_set.id).create_async()
    await grid.export_to_record_set_async()
    await grid.delete_async()
    # Re-fetch the RecordSet to get updated validation_file_handle_id
    record_set = await record_set.get_async()
    # Get the detailed validation results
    results_df = await record_set.get_detailed_validation_results_async(download_location="/Users/lpeng/code/synapsePythonClient")
    print('result df', results_df)
    # Analyze the results
    print(f"Total rows: {len(results_df)}")
    print(f"Columns: {results_df.columns.tolist()}")
    # Filter for valid and invalid rows
    # Note: is_valid is boolean (True/False) for validated rows
    valid_rows = results_df[results_df['is_valid'] == True]  # noqa: E712
    invalid_rows = results_df[results_df['is_valid'] == False]  # noqa: E712
    print(f"Valid rows: {len(valid_rows)}")
    print(f"Invalid rows: {len(invalid_rows)}")
    # View invalid rows with their error messages
    if len(invalid_rows) > 0:
        print(invalid_rows[['row_index', 'validation_error_message']])
asyncio.run(main())

Without modifying anything in the grid, I got different validation results:
The first time running the code:

"row_index","is_valid","validation_error_message","all_validation_messages"
"0","true",,

The second time running the code:

"row_index","is_valid","validation_error_message","all_validation_messages"
"0",,,

Any ideas why this is happening?

tests/integration/synapseclient/models/synchronous/test_recordset.py

SageGJ · 2025-12-01T18:10:56Z

docs/guides/extensions/curator/metadata_curation.md

+syn.login()
+
+# Get your RecordSet (must have a schema bound)
+record_set = RecordSet(id="syn987654321").get()


would it be helpful for this id to link to an actual record set? right now it leads to a Page Unavailable message

That is intentional. None of the examples across all of the project lead to real Synapse entities, or use real IDs.

SageGJ · 2025-12-01T18:13:02Z

synapseclient/core/typing_utils.py

+else:
+    # At runtime, use object as a placeholder
+    DataFrame = object
+    Series = object
+    np = object  # type: ignore[misc, assignment]
+    nx = object  # type: ignore[misc, assignment]


Would we want to display a message in the cases of ImportErrors or not type checking to notify users that the specific types weren't available?

In several places we are using this function:

synapsePythonClient/synapseclient/core/utils.py

Lines 1553 to 1569 in 79ad99f

def test_import_pandas() -> None:

"""This function is called within other functions and methods to ensure that pandas is installed."""

try:

import pandas as pd # noqa F401

# used to catch when pandas isn't installed

except ModuleNotFoundError:

raise ModuleNotFoundError(

"""\n\nThe pandas package is required for this function!\n

Most functions in the synapseclient package don't require the

installation of pandas, but some do. Please refer to the installation

instructions at: http://pandas.pydata.org/ or

https://python-docs.synapse.org/tutorials/installation/#installation-guide-for-pypi-users.

\n\n\n"""

)

# catch other errors (see SYNPY-177)

except: # noqa

raise

Which will accomplish what you are mentioning to tell folks that they need to use the package. I don't think in this typing_utils is the appropriate place to put the message because I don't want the message to print unless they actually try to call a function/method where the import is required.

BryanFauble · 2025-12-16T17:13:50Z

Hi @BryanFauble and @thomasyu888 ! The code looks good to me, but I am confused when I tested out the functionality. I started by creating a record set, a task, and a grid by using a test data schema:

record_set, task, grid = create_record_based_metadata_task(
    synapse_client=syn,
    project_id="syn71816385",
    folder_id="syn71816388",
    record_set_name="BiospecimenMetadata_RecordSet",
    record_set_description="RecordSet for biospecimen metadata curation",
    curation_task_name="BiospecimenMetadataTemplate",
    upsert_keys=["specimenID"],
    instructions="Please curate this metadata according to the schema requirements",
    schema_uri="sage.schemas.v2571-nf.BiospecimenTemplate.schema-9.14.0"
)

I entered some data in the grid on the UI, but when I tried to download the validation result:

async def main():
    syn = Synapse()
    syn.login()
    # Assuming you have a RecordSet with a bound schema
    record_set = await RecordSet(id="syn71816405").get_async()
    # Create and export Grid session to generate validation results
    grid = await Grid(record_set_id=record_set.id).create_async()
    await grid.export_to_record_set_async()
    await grid.delete_async()
    # Re-fetch the RecordSet to get updated validation_file_handle_id
    record_set = await record_set.get_async()
    # Get the detailed validation results
    results_df = await record_set.get_detailed_validation_results_async(download_location="/Users/lpeng/code/synapsePythonClient")
    print('result df', results_df)
    # Analyze the results
    print(f"Total rows: {len(results_df)}")
    print(f"Columns: {results_df.columns.tolist()}")
    # Filter for valid and invalid rows
    # Note: is_valid is boolean (True/False) for validated rows
    valid_rows = results_df[results_df['is_valid'] == True]  # noqa: E712
    invalid_rows = results_df[results_df['is_valid'] == False]  # noqa: E712
    print(f"Valid rows: {len(valid_rows)}")
    print(f"Invalid rows: {len(invalid_rows)}")
    # View invalid rows with their error messages
    if len(invalid_rows) > 0:
        print(invalid_rows[['row_index', 'validation_error_message']])
asyncio.run(main())

Without modifying anything in the grid, I got different validation results: The first time running the code:

"row_index","is_valid","validation_error_message","all_validation_messages"
"0","true",,

The second time running the code:

"row_index","is_valid","validation_error_message","all_validation_messages"
"0",,,

Any ideas why this is happening?

@linglp

I found that there were eventually consistent race condition issues that we have to contend with here. In the tests what I found to consistently work was to place sleep statements in between like this test here:

synapsePythonClient/tests/integration/synapseclient/models/synchronous/test_recordset.py

Lines 489 to 537 in 79ad99f

    
           record_set = RecordSet( 
        
               path=filename, 
        
               name=str(uuid.uuid4()), 
        
               description="Test RecordSet for validation testing", 
        
               version_comment="Validation test version", 
        
               version_label=str(uuid.uuid4()), 
        
               upsert_keys=["id", "name"], 
        
           ) 
        
           stored_record_set = record_set.store( 
        
               parent=project_model, synapse_client=self.syn 
        
           ) 
        
           self.schedule_for_cleanup(stored_record_set.id) 
        
           record_set_ids.append(stored_record_set.id)  # Track for schema cleanup 
        
           time.sleep(10) 
        
           # Bind the JSON schema to the RecordSet 
        
           stored_record_set.bind_schema( 
        
               json_schema_uri=schema_uri, 
        
               enable_derived_annotations=False, 
        
               synapse_client=self.syn, 
        
           ) 
        
           time.sleep(10) 
        
           # Verify the schema is bound by getting the schema from the entity 
        
           stored_record_set.get_schema(synapse_client=self.syn) 
        
           # Create a Grid session from the RecordSet 
        
           grid = Grid(record_set_id=stored_record_set.id) 
        
           created_grid = grid.create( 
        
               timeout=ASYNC_JOB_TIMEOUT_SEC, synapse_client=self.syn 
        
           ) 
        
           time.sleep(10) 
        
           # Export the Grid back to RecordSet to generate validation results 
        
           exported_grid = created_grid.export_to_record_set( 
        
               timeout=ASYNC_JOB_TIMEOUT_SEC, synapse_client=self.syn 
        
           ) 
        
           # Clean up the Grid session 
        
           exported_grid.delete(synapse_client=self.syn) 
        
           # Re-fetch the RecordSet to get the updated validation_file_handle_id 
        
           updated_record_set = RecordSet(id=stored_record_set.id).get( 
        
               synapse_client=self.syn 
        
           )

So I think if I were to modify the script that you provided I would add a sleep statement to it. Try this out and let me know:

async def main():
    syn = Synapse()
    syn.login()
    # Assuming you have a RecordSet with a bound schema
    record_set = await RecordSet(id="syn71816405").get_async()
    # Create and export Grid session to generate validation results
    grid = await Grid(record_set_id=record_set.id).create_async()
    
    await asyncio.sleep(10) # This line was added -----------------
    
    await grid.export_to_record_set_async()
    await grid.delete_async()
    # Re-fetch the RecordSet to get updated validation_file_handle_id
    record_set = await record_set.get_async()
    # Get the detailed validation results
    results_df = await record_set.get_detailed_validation_results_async(download_location="/Users/lpeng/code/synapsePythonClient")
    print('result df', results_df)
    # Analyze the results
    print(f"Total rows: {len(results_df)}")
    print(f"Columns: {results_df.columns.tolist()}")
    # Filter for valid and invalid rows
    # Note: is_valid is boolean (True/False) for validated rows
    valid_rows = results_df[results_df['is_valid'] == True]  # noqa: E712
    invalid_rows = results_df[results_df['is_valid'] == False]  # noqa: E712
    print(f"Valid rows: {len(valid_rows)}")
    print(f"Invalid rows: {len(invalid_rows)}")
    # View invalid rows with their error messages
    if len(invalid_rows) > 0:
        print(invalid_rows[['row_index', 'validation_error_message']])
asyncio.run(main())

…file handle ID

linglp · 2025-12-16T21:13:15Z

@BryanFauble Thanks for the suggestion. I tested it, and now the results are consistent!

linglp

LGTM! Thanks for the hard work!

Updating recordset to include the validation_file_handle_id

84fcd51

BryanFauble requested a review from a team as a code owner November 24, 2025 21:49

BryanFauble commented Nov 24, 2025

View reviewed changes

thomasyu888 requested a review from aditigopalan November 25, 2025 17:20

thomasyu888 reviewed Nov 25, 2025

View reviewed changes

synapseclient/models/recordset.py Show resolved Hide resolved

thomasyu888 reviewed Nov 25, 2025

View reviewed changes

synapseclient/core/typing_utils.py

Copy link

Member

thomasyu888 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic

BryanFauble reacted with rocket emoji

thomasyu888 reviewed Nov 25, 2025

View reviewed changes

tests/integration/synapseclient/models/async/test_recordset_async.py Show resolved Hide resolved

thomasyu888 reviewed Nov 25, 2025

View reviewed changes

BryanFauble requested a review from a team November 25, 2025 18:28

Add test_import_pandas utility function and integrate it across multi…

13c73a1

…ple modules to ensure pandas is installed where needed

thomasyu888 requested a review from SageGJ November 26, 2025 22:26

Adding in some sleeps inbetween steps for async processing

79ad99f

linglp reviewed Dec 1, 2025

View reviewed changes

tests/integration/synapseclient/models/synchronous/test_recordset.py Outdated Show resolved Hide resolved

SageGJ reviewed Dec 1, 2025

View reviewed changes

Rename validation error test to clarify logging behavior for missing …

a32f881

…file handle ID

BryanFauble requested review from SageGJ and linglp December 16, 2025 17:26

Merge branch 'develop' into SYNPY-1700-recordset-attribute

38ced9e

linglp approved these changes Dec 16, 2025

View reviewed changes

BryanFauble merged commit 4361257 into develop Dec 23, 2025
39 of 50 checks passed

BryanFauble deleted the SYNPY-1700-recordset-attribute branch December 23, 2025 17:22

	def test_import_pandas() -> None:
	"""This function is called within other functions and methods to ensure that pandas is installed."""
	try:
	import pandas as pd # noqa F401
	# used to catch when pandas isn't installed
	except ModuleNotFoundError:
	raise ModuleNotFoundError(
	"""\n\nThe pandas package is required for this function!\n
	Most functions in the synapseclient package don't require the
	installation of pandas, but some do. Please refer to the installation
	instructions at: http://pandas.pydata.org/ or
	https://python-docs.synapse.org/tutorials/installation/#installation-guide-for-pypi-users.
	\n\n\n"""
	)
	# catch other errors (see SYNPY-177)
	except: # noqa
	raise

[SYNPY-1700] Updating recordset to include the validation_file_handle_id #1283

[SYNPY-1700] Updating recordset to include the validation_file_handle_id #1283

Uh oh!

Conversation

BryanFauble commented Nov 24, 2025

Problem:

Solution:

Testing:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasyu888 left a comment

Choose a reason for hiding this comment

Uh oh!

linglp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanFauble commented Dec 16, 2025

Uh oh!

linglp commented Dec 16, 2025

Uh oh!

linglp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SYNPY-1700] Updating recordset to include the `validation_file_handle_id` #1283

[SYNPY-1700] Updating recordset to include the `validation_file_handle_id` #1283