-
Notifications
You must be signed in to change notification settings - Fork 73
[SYNPY-1700] Updating recordset to include the validation_file_handle_id
#1283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| ## Step 4: Work with metadata and validate (Record-based workflow) | ||
|
|
||
| After creating a record-based metadata task, collaborators can enter metadata through the Grid interface. Once metadata entry is complete, you'll want to validate the data against your schema and identify any issues. | ||
|
|
||
| ### The metadata curation workflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created https://sagebionetworks.jira.com/browse/SYNPY-1712 so we can revisit this and make doing this work a bit easier for folks
| def create( | ||
| self, | ||
| attach_to_previous_session=True, | ||
| attach_to_previous_session=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change because if you do have a previous session open, and you are not aware that this defaults to True it could drop you into a session you did not intend to and overwrite, or delete data in the RecordSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so the default behavior is always going to be creation of a new session? Which means data within the working session previously won't be included?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case that is correct unless they change that boolean to True.
Users may also use the list method as well to get a list of their active grid sessions and select the correct one from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I actually think the backend is going to change where the default is one session, and multiple people can connect and iterate on that one session.
We can figure it out then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic
thomasyu888
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 Thanks for the great work here. I tagged @aditigopalan for a review as well, but I'll defer a final review to another person on the team.
…ple modules to ensure pandas is installed where needed
linglp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @BryanFauble and @thomasyu888 ! The code looks good to me, but I am confused when I tested out the functionality. I started by creating a record set, a task, and a grid by using a test data schema:
record_set, task, grid = create_record_based_metadata_task(
synapse_client=syn,
project_id="syn71816385",
folder_id="syn71816388",
record_set_name="BiospecimenMetadata_RecordSet",
record_set_description="RecordSet for biospecimen metadata curation",
curation_task_name="BiospecimenMetadataTemplate",
upsert_keys=["specimenID"],
instructions="Please curate this metadata according to the schema requirements",
schema_uri="sage.schemas.v2571-nf.BiospecimenTemplate.schema-9.14.0"
)
I entered some data in the grid on the UI, but when I tried to download the validation result:
async def main():
syn = Synapse()
syn.login()
# Assuming you have a RecordSet with a bound schema
record_set = await RecordSet(id="syn71816405").get_async()
# Create and export Grid session to generate validation results
grid = await Grid(record_set_id=record_set.id).create_async()
await grid.export_to_record_set_async()
await grid.delete_async()
# Re-fetch the RecordSet to get updated validation_file_handle_id
record_set = await record_set.get_async()
# Get the detailed validation results
results_df = await record_set.get_detailed_validation_results_async(download_location="/Users/lpeng/code/synapsePythonClient")
print('result df', results_df)
# Analyze the results
print(f"Total rows: {len(results_df)}")
print(f"Columns: {results_df.columns.tolist()}")
# Filter for valid and invalid rows
# Note: is_valid is boolean (True/False) for validated rows
valid_rows = results_df[results_df['is_valid'] == True] # noqa: E712
invalid_rows = results_df[results_df['is_valid'] == False] # noqa: E712
print(f"Valid rows: {len(valid_rows)}")
print(f"Invalid rows: {len(invalid_rows)}")
# View invalid rows with their error messages
if len(invalid_rows) > 0:
print(invalid_rows[['row_index', 'validation_error_message']])
asyncio.run(main())
Without modifying anything in the grid, I got different validation results:
The first time running the code:
"row_index","is_valid","validation_error_message","all_validation_messages"
"0","true",,
The second time running the code:
"row_index","is_valid","validation_error_message","all_validation_messages"
"0",,,
Any ideas why this is happening?
tests/integration/synapseclient/models/synchronous/test_recordset.py
Outdated
Show resolved
Hide resolved
| syn.login() | ||
|
|
||
| # Get your RecordSet (must have a schema bound) | ||
| record_set = RecordSet(id="syn987654321").get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be helpful for this id to link to an actual record set? right now it leads to a Page Unavailable message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is intentional. None of the examples across all of the project lead to real Synapse entities, or use real IDs.
| else: | ||
| # At runtime, use object as a placeholder | ||
| DataFrame = object | ||
| Series = object | ||
| np = object # type: ignore[misc, assignment] | ||
| nx = object # type: ignore[misc, assignment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we want to display a message in the cases of ImportErrors or not type checking to notify users that the specific types weren't available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In several places we are using this function:
synapsePythonClient/synapseclient/core/utils.py
Lines 1553 to 1569 in 79ad99f
| def test_import_pandas() -> None: | |
| """This function is called within other functions and methods to ensure that pandas is installed.""" | |
| try: | |
| import pandas as pd # noqa F401 | |
| # used to catch when pandas isn't installed | |
| except ModuleNotFoundError: | |
| raise ModuleNotFoundError( | |
| """\n\nThe pandas package is required for this function!\n | |
| Most functions in the synapseclient package don't require the | |
| installation of pandas, but some do. Please refer to the installation | |
| instructions at: http://pandas.pydata.org/ or | |
| https://python-docs.synapse.org/tutorials/installation/#installation-guide-for-pypi-users. | |
| \n\n\n""" | |
| ) | |
| # catch other errors (see SYNPY-177) | |
| except: # noqa | |
| raise |
Which will accomplish what you are mentioning to tell folks that they need to use the package. I don't think in this typing_utils is the appropriate place to put the message because I don't want the message to print unless they actually try to call a function/method where the import is required.
I found that there were eventually consistent race condition issues that we have to contend with here. In the tests what I found to consistently work was to place sleep statements in between like this test here: synapsePythonClient/tests/integration/synapseclient/models/synchronous/test_recordset.py Lines 489 to 537 in 79ad99f
So I think if I were to modify the script that you provided I would add a sleep statement to it. Try this out and let me know: |
|
@BryanFauble Thanks for the suggestion. I tested it, and now the results are consistent! |
linglp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the hard work!
Problem:
validation_file_handle_idfield for the RecordSet that points to a validation CSV file denoting the specific issues with the validation rules.TypeVarstrings that we were using before, and not actually rendering out to Pandas typesSolution:
Testing: