Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
254 changes: 254 additions & 0 deletions docs/guides/extensions/curator/metadata_curation.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ By following this guide, you will:
- Create a metadata curation workflow with automatic validation
- Set up either file-based or record-based metadata collection
- Configure curation tasks that guide collaborators through metadata entry
- Retrieve and analyze detailed validation results to identify data quality issues

## Prerequisites

Expand Down Expand Up @@ -178,6 +179,256 @@ print(f" EntityView: {entity_view_id}")
print(f" CurationTask: {task_id}")
```

## Step 4: Work with metadata and validate (Record-based workflow)

After creating a record-based metadata task, collaborators can enter metadata through the Grid interface. Once metadata entry is complete, you'll want to validate the data against your schema and identify any issues.

### The metadata curation workflow
Comment on lines +182 to +186
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created https://sagebionetworks.jira.com/browse/SYNPY-1712 so we can revisit this and make doing this work a bit easier for folks


1. **Data Entry**: Collaborators use the Grid interface (via the curation task link in the Synapse web UI) to enter metadata
2. **Grid Export**: Export the Grid session back to the RecordSet to save changes (this can be done via the web UI or programmatically)
3. **Validation**: Retrieve detailed validation results to identify schema violations
4. **Correction**: Fix any validation errors and repeat as needed

### Creating and exporting a Grid session

Validation results are only generated when a Grid session is exported back to the RecordSet. This triggers Synapse to validate each row against the bound schema. You have two options:

**Option A: Via the Synapse web UI (most common)**

Users can access the curation task through the Synapse web interface, enter/edit data in the Grid, and click the export button. This automatically generates validation results.

**Option B: Programmatically create and export a Grid session**

```python
from synapseclient import Synapse
from synapseclient.models import RecordSet
from synapseclient.models.curation import Grid

syn = Synapse()
syn.login()

# Get your RecordSet (must have a schema bound)
record_set = RecordSet(id="syn987654321").get()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be helpful for this id to link to an actual record set? right now it leads to a Page Unavailable message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is intentional. None of the examples across all of the project lead to real Synapse entities, or use real IDs.


# Create a Grid session from the RecordSet
grid = Grid(record_set_id=record_set.id).create()

# At this point, users can interact with the Grid (either programmatically or via web UI)
# When ready to save changes and generate validation results, export back to RecordSet
grid.export_to_record_set()

# Clean up the Grid session
grid.delete()

# Re-fetch the RecordSet to get the updated validation_file_handle_id
record_set = RecordSet(id=record_set.id).get()
```

**Important**: The `validation_file_handle_id` attribute is only populated after a Grid export operation. Until then, `get_detailed_validation_results()` will return `None`.

### Getting detailed validation results

After exporting from a Grid session with a bound schema, Synapse automatically validates each row against the schema and generates a detailed validation report. Here's how to retrieve and analyze those results:

```python
from synapseclient import Synapse
from synapseclient.models import RecordSet

syn = Synapse()
syn.login()

# After Grid export (either via web UI or programmatically)
# retrieve the updated RecordSet
record_set = RecordSet(id="syn987654321").get()

# Get detailed validation results as a pandas DataFrame
validation_results = record_set.get_detailed_validation_results()

if validation_results is not None:
print(f"Total rows validated: {len(validation_results)}")

# Filter for valid and invalid rows
valid_rows = validation_results[validation_results['is_valid'] == True]
invalid_rows = validation_results[validation_results['is_valid'] == False]

print(f"Valid rows: {len(valid_rows)}")
print(f"Invalid rows: {len(invalid_rows)}")

# Display details of any validation errors
if len(invalid_rows) > 0:
print("\nRows with validation errors:")
for idx, row in invalid_rows.iterrows():
print(f"\nRow {row['row_index']}:")
print(f" Error: {row['validation_error_message']}")
print(f" ValidationError: {row['all_validation_messages']}")
else:
print("No validation results available. The Grid session must be exported to generate validation results.")
```

### Example: Complete validation workflow for animal study metadata

This example demonstrates the full workflow from creating a curation task through validating the submitted metadata:

```python
from synapseclient import Synapse
from synapseclient.extensions.curator import create_record_based_metadata_task, query_schema_registry
from synapseclient.models import RecordSet
from synapseclient.models.curation import Grid
import pandas as pd
import tempfile
import os
import time

syn = Synapse()
syn.login()

# Step 1: Find the schema
schema_uri = query_schema_registry(
synapse_client=syn,
dcc="ad",
datatype="IndividualAnimalMetadataTemplate"
)

# Step 1.5: Create initial test data with validation examples
# Row 1: VALID - all required fields present and valid
# Row 2: INVALID - missing required field 'genotype'
# Row 3: INVALID - invalid enum value for 'sex' ("other" not in enum)
test_data = pd.DataFrame({
"individualID": ["ANIMAL001", "ANIMAL002", "ANIMAL003"],
"species": ["Mouse", "Mouse", "Mouse"],
"sex": ["female", "male", "other"], # Row 3: invalid enum
"genotype": ["5XFAD", None, "APOE4KI"], # Row 2: missing required field
"genotypeBackground": ["C57BL/6J", "C57BL/6J", "C57BL/6J"],
"modelSystemName": ["5XFAD", "5XFAD", "APOE4KI"],
"dateBirth": ["2024-01-15", "2024-02-20", "2024-03-10"],
"individualIdSource": ["JAX", "JAX", "JAX"],
})

# Create a temporary CSV file with the test data
temp_fd, temp_csv = tempfile.mkstemp(suffix=".csv")
os.close(temp_fd)
test_data.to_csv(temp_csv, index=False)

# Step 2: Create the curation task (this creates an empty template RecordSet)
record_set, curation_task, data_grid = create_record_based_metadata_task(
synapse_client=syn,
project_id="syn123456789",
folder_id="syn987654321",
record_set_name="AnimalMetadata_Records",
record_set_description="Animal study metadata with validation",
curation_task_name="AnimalMetadata_Validation_Example",
upsert_keys=["individualID"],
instructions="Enter metadata for each animal. All required fields must be completed.",
schema_uri=schema_uri,
bind_schema_to_record_set=True,
)

time.sleep(10)

print(f"Curation task created with ID: {curation_task.task_id}")
print(f"RecordSet created with ID: {record_set.id}")

# Step 2.5: Upload the test data to the RecordSet
record_set = RecordSet(id=record_set.id).get(synapse_client=syn)
print("\nUploading test data to RecordSet...")
record_set.path = temp_csv
record_set = record_set.store(synapse_client=syn)
print(f"Test data uploaded to RecordSet {record_set.id}")

# Step 3: Collaborators enter data via the web UI, OR you can create/export a Grid programmatically
# For demonstration, here's the programmatic approach:
print("\nCreating Grid session for data entry...")
grid = Grid(record_set_id=record_set.id).create()
print("Grid session created. Users can now enter data.")

# After data entry is complete (either via web UI or programmatically),
# export the Grid to generate validation results
print("\nExporting Grid to RecordSet to generate validation results...")
grid.export_to_record_set()

# Clean up the Grid session
grid.delete()
print("Grid session exported and deleted.")

# Step 4: Refresh the RecordSet to get the latest validation results
print("\nRefreshing RecordSet to retrieve validation results...")
record_set = RecordSet(id=record_set.id).get()

# Step 5: Analyze validation results
validation_df = record_set.get_detailed_validation_results()

if validation_df is not None:
# Summary statistics
total_rows = len(validation_df)
valid_count = (validation_df['is_valid'] == True).sum() # noqa: E712
invalid_count = (validation_df['is_valid'] == False).sum() # noqa: E712

print("\n=== Validation Summary ===")
print(f"Total records: {total_rows}")
print(f"Valid records: {valid_count} ({valid_count}/{total_rows})")
print(f"Invalid records: {invalid_count} ({invalid_count}/{total_rows})")

# Group errors by type for better understanding
if invalid_count > 0:
invalid_rows = validation_df[validation_df['is_valid'] == False] # noqa: E712

# Export detailed error report for review
error_report = invalid_rows[['row_index', 'validation_error_message', 'all_validation_messages']]
error_report_path = "validation_errors_report.csv"
error_report.to_csv(error_report_path, index=False)
print(f"\nDetailed error report saved to: {error_report_path}")

# Show first few errors as examples
print("\n=== Sample Validation Errors ===")
for idx, row in error_report.head(3).iterrows():
print(f"\nRow {row['row_index']}:")
print(f" Error: {row['validation_error_message']}")
print(f" ValidationError: {row['all_validation_messages']}")

# Clean up temporary file
if os.path.exists(temp_csv):
os.unlink(temp_csv)
```

In this example you would expect to get results like:

```
=== Sample Validation Errors ===

Row 0:
Error: expected type: String, found: Long
ValidationError: ["#/dateBirth: expected type: String, found: Long"]

Row 1:
Error: 2 schema violations found
ValidationError: ["#/genotype: expected type: String, found: Null","#/dateBirth: expected type: String, found: Long"]

Row 2:
Error: 2 schema violations found
ValidationError: ["#/dateBirth: expected type: String, found: Long","#/sex: other is not a valid enum value"]
```

**Key points about validation results:**

- **Automatic generation**: Validation results are created automatically when you export data from a Grid session with a bound schema
- **Row-level detail**: Each row in your RecordSet gets its own validation status and error messages
- **Multiple violations**: The `all_validation_messages` column contains all schema violations for a row, not just the first one
- **Iterative correction**: Use the validation results to identify issues, make corrections in the Grid, export again, and re-validate

### When validation results are available

Validation results are only available after:
1. A JSON schema has been bound to the RecordSet (set `bind_schema_to_record_set=True` when creating the task)
2. Data has been entered through a Grid session
3. **The Grid session has been exported back to the RecordSet** - This is the critical step that triggers validation and populates the `validation_file_handle_id` attribute

The export can happen in two ways:
- **Via the Synapse web UI**: Users click the export/save button in the Grid interface
- **Programmatically**: Call `grid.export_to_record_set()` after creating a Grid session

If `get_detailed_validation_results()` returns `None`, the most common reason is that the Grid session hasn't been exported yet. Check that `record_set.validation_file_handle_id` is not `None` after exporting.

## Additional utilities

### Validate schema binding on folders
Expand Down Expand Up @@ -227,6 +478,9 @@ for curation_task in CurationTask.list(
- [query_schema_registry][synapseclient.extensions.curator.query_schema_registry] - Search for schemas in the registry
- [create_record_based_metadata_task][synapseclient.extensions.curator.create_record_based_metadata_task] - Create RecordSet-based curation workflows
- [create_file_based_metadata_task][synapseclient.extensions.curator.create_file_based_metadata_task] - Create EntityView-based curation workflows
- [RecordSet.get_detailed_validation_results][synapseclient.models.RecordSet.get_detailed_validation_results] - Get detailed validation results for RecordSet data
- [Grid.create][synapseclient.models.curation.Grid.create] - Create a Grid session from a RecordSet
- [Grid.export_to_record_set][synapseclient.models.curation.Grid.export_to_record_set] - Export Grid data back to RecordSet and generate validation results
- [Folder.bind_schema][synapseclient.models.Folder.bind_schema] - Bind schemas to folders
- [Folder.validate_schema][synapseclient.models.Folder.validate_schema] - Validate folder schema compliance
- [CurationTask.list][synapseclient.models.CurationTask.list] - List curation tasks in a project
Expand Down
1 change: 1 addition & 0 deletions docs/reference/experimental/async/curator.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ at your own risk.
- get_async
- store_async
- delete_async
- get_detailed_validation_results_async
- get_acl_async
- get_permissions_async
- set_permissions_async
Expand Down
1 change: 1 addition & 0 deletions docs/reference/experimental/sync/curator.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ at your own risk.
- get
- store
- delete
- get_detailed_validation_results
- get_acl
- get_permissions
- set_permissions
Expand Down
32 changes: 32 additions & 0 deletions synapseclient/core/typing_utils.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic

Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
"""Typing utilities for optional dependencies.
This module provides type aliases for optional dependencies like pandas and numpy,
allowing proper type checking without requiring these packages to be installed.
"""

from typing import TYPE_CHECKING, Any

if TYPE_CHECKING:
try:
from pandas import DataFrame, Series
except ImportError:
DataFrame = Any # type: ignore[misc, assignment]
Series = Any # type: ignore[misc, assignment]

try:
import numpy as np
except ImportError:
np = Any # type: ignore[misc, assignment]

try:
import networkx as nx
except ImportError:
nx = Any # type: ignore[misc, assignment]
else:
# At runtime, use object as a placeholder
DataFrame = object
Series = object
np = object # type: ignore[misc, assignment]
nx = object # type: ignore[misc, assignment]
Comment on lines +25 to +30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we want to display a message in the cases of ImportErrors or not type checking to notify users that the specific types weren't available?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In several places we are using this function:

def test_import_pandas() -> None:
"""This function is called within other functions and methods to ensure that pandas is installed."""
try:
import pandas as pd # noqa F401
# used to catch when pandas isn't installed
except ModuleNotFoundError:
raise ModuleNotFoundError(
"""\n\nThe pandas package is required for this function!\n
Most functions in the synapseclient package don't require the
installation of pandas, but some do. Please refer to the installation
instructions at: http://pandas.pydata.org/ or
https://python-docs.synapse.org/tutorials/installation/#installation-guide-for-pypi-users.
\n\n\n"""
)
# catch other errors (see SYNPY-177)
except: # noqa
raise

Which will accomplish what you are mentioning to tell folks that they need to use the package. I don't think in this typing_utils is the appropriate place to put the message because I don't want the message to print unless they actually try to call a function/method where the import is required.


__all__ = ["DataFrame", "Series", "np", "nx"]
4 changes: 1 addition & 3 deletions synapseclient/core/upload/multipart_upload_async.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,6 @@
Mapping,
Optional,
Tuple,
TypeVar,
Union,
)

Expand Down Expand Up @@ -107,6 +106,7 @@
)
from synapseclient.core.otel_config import get_tracer
from synapseclient.core.retry import with_retry_time_based
from synapseclient.core.typing_utils import DataFrame as DATA_FRAME_TYPE
from synapseclient.core.upload.upload_utils import (
copy_md5_fn,
copy_part_request_body_provider_fn,
Expand All @@ -123,8 +123,6 @@
if TYPE_CHECKING:
from synapseclient import Synapse

DATA_FRAME_TYPE = TypeVar("pd.DataFrame")

# AWS limits
MAX_NUMBER_OF_PARTS = 10000
MIN_PART_SIZE = 5 * MB
Expand Down
4 changes: 2 additions & 2 deletions synapseclient/core/upload/upload_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
import math
import re
from io import BytesIO, StringIO
from typing import Any, Dict, Optional, TypeVar, Union
from typing import Any, Dict, Optional, Union

DATA_FRAME_TYPE = TypeVar("pd.DataFrame")
from synapseclient.core.typing_utils import DataFrame as DATA_FRAME_TYPE


def get_partial_dataframe_chunk(
Expand Down
19 changes: 19 additions & 0 deletions synapseclient/core/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1548,3 +1548,22 @@ def log_dataclass_diff(
value2 = getattr(obj2, field.name)
if value1 != value2:
logger.info(f"{prefix}{field.name}: {value1} -> {value2}")


def test_import_pandas() -> None:
"""This function is called within other functions and methods to ensure that pandas is installed."""
try:
import pandas as pd # noqa F401
# used to catch when pandas isn't installed
except ModuleNotFoundError:
raise ModuleNotFoundError(
"""\n\nThe pandas package is required for this function!\n
Most functions in the synapseclient package don't require the
installation of pandas, but some do. Please refer to the installation
instructions at: http://pandas.pydata.org/ or
https://python-docs.synapse.org/tutorials/installation/#installation-guide-for-pypi-users.
\n\n\n"""
)
# catch other errors (see SYNPY-177)
except: # noqa
raise
Loading
Loading