Skip to content

Commit 00a7dfe

Browse files
author
Lingling Peng
committed
Merge branch 'develop' into synpy-1674-address-feedbacks
2 parents db94d96 + 4361257 commit 00a7dfe

File tree

27 files changed

+1904
-127
lines changed

27 files changed

+1904
-127
lines changed

.github/workflows/build.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ jobs:
5151
strategy:
5252
fail-fast: false
5353
matrix:
54-
os: [ubuntu-22.04, macos-13, windows-2022]
54+
os: [ubuntu-22.04, macos-15-intel, windows-2022]
5555

5656
# if changing the below change the run-integration-tests versions and the check-deploy versions
5757
# Make sure that we are running the integration tests on the first and last versions of the matrix
@@ -486,7 +486,7 @@ jobs:
486486

487487
strategy:
488488
matrix:
489-
os: [ubuntu-24.04, macos-13, windows-2022]
489+
os: [ubuntu-24.04, macos-15-intel, windows-2022]
490490

491491
# python versions should be consistent with the strategy matrix and the runs-integration-tests versions
492492
python: ["3.10", "3.11", "3.12", "3.13", "3.14"]

docs/guides/extensions/curator/metadata_curation.md

Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ By following this guide, you will:
1010
- Create a metadata curation workflow with automatic validation
1111
- Set up either file-based or record-based metadata collection
1212
- Configure curation tasks that guide collaborators through metadata entry
13+
- Retrieve and analyze detailed validation results to identify data quality issues
1314

1415
## Prerequisites
1516

@@ -178,6 +179,256 @@ print(f" EntityView: {entity_view_id}")
178179
print(f" CurationTask: {task_id}")
179180
```
180181

182+
## Step 4: Work with metadata and validate (Record-based workflow)
183+
184+
After creating a record-based metadata task, collaborators can enter metadata through the Grid interface. Once metadata entry is complete, you'll want to validate the data against your schema and identify any issues.
185+
186+
### The metadata curation workflow
187+
188+
1. **Data Entry**: Collaborators use the Grid interface (via the curation task link in the Synapse web UI) to enter metadata
189+
2. **Grid Export**: Export the Grid session back to the RecordSet to save changes (this can be done via the web UI or programmatically)
190+
3. **Validation**: Retrieve detailed validation results to identify schema violations
191+
4. **Correction**: Fix any validation errors and repeat as needed
192+
193+
### Creating and exporting a Grid session
194+
195+
Validation results are only generated when a Grid session is exported back to the RecordSet. This triggers Synapse to validate each row against the bound schema. You have two options:
196+
197+
**Option A: Via the Synapse web UI (most common)**
198+
199+
Users can access the curation task through the Synapse web interface, enter/edit data in the Grid, and click the export button. This automatically generates validation results.
200+
201+
**Option B: Programmatically create and export a Grid session**
202+
203+
```python
204+
from synapseclient import Synapse
205+
from synapseclient.models import RecordSet
206+
from synapseclient.models.curation import Grid
207+
208+
syn = Synapse()
209+
syn.login()
210+
211+
# Get your RecordSet (must have a schema bound)
212+
record_set = RecordSet(id="syn987654321").get()
213+
214+
# Create a Grid session from the RecordSet
215+
grid = Grid(record_set_id=record_set.id).create()
216+
217+
# At this point, users can interact with the Grid (either programmatically or via web UI)
218+
# When ready to save changes and generate validation results, export back to RecordSet
219+
grid.export_to_record_set()
220+
221+
# Clean up the Grid session
222+
grid.delete()
223+
224+
# Re-fetch the RecordSet to get the updated validation_file_handle_id
225+
record_set = RecordSet(id=record_set.id).get()
226+
```
227+
228+
**Important**: The `validation_file_handle_id` attribute is only populated after a Grid export operation. Until then, `get_detailed_validation_results()` will return `None`.
229+
230+
### Getting detailed validation results
231+
232+
After exporting from a Grid session with a bound schema, Synapse automatically validates each row against the schema and generates a detailed validation report. Here's how to retrieve and analyze those results:
233+
234+
```python
235+
from synapseclient import Synapse
236+
from synapseclient.models import RecordSet
237+
238+
syn = Synapse()
239+
syn.login()
240+
241+
# After Grid export (either via web UI or programmatically)
242+
# retrieve the updated RecordSet
243+
record_set = RecordSet(id="syn987654321").get()
244+
245+
# Get detailed validation results as a pandas DataFrame
246+
validation_results = record_set.get_detailed_validation_results()
247+
248+
if validation_results is not None:
249+
print(f"Total rows validated: {len(validation_results)}")
250+
251+
# Filter for valid and invalid rows
252+
valid_rows = validation_results[validation_results['is_valid'] == True]
253+
invalid_rows = validation_results[validation_results['is_valid'] == False]
254+
255+
print(f"Valid rows: {len(valid_rows)}")
256+
print(f"Invalid rows: {len(invalid_rows)}")
257+
258+
# Display details of any validation errors
259+
if len(invalid_rows) > 0:
260+
print("\nRows with validation errors:")
261+
for idx, row in invalid_rows.iterrows():
262+
print(f"\nRow {row['row_index']}:")
263+
print(f" Error: {row['validation_error_message']}")
264+
print(f" ValidationError: {row['all_validation_messages']}")
265+
else:
266+
print("No validation results available. The Grid session must be exported to generate validation results.")
267+
```
268+
269+
### Example: Complete validation workflow for animal study metadata
270+
271+
This example demonstrates the full workflow from creating a curation task through validating the submitted metadata:
272+
273+
```python
274+
from synapseclient import Synapse
275+
from synapseclient.extensions.curator import create_record_based_metadata_task, query_schema_registry
276+
from synapseclient.models import RecordSet
277+
from synapseclient.models.curation import Grid
278+
import pandas as pd
279+
import tempfile
280+
import os
281+
import time
282+
283+
syn = Synapse()
284+
syn.login()
285+
286+
# Step 1: Find the schema
287+
schema_uri = query_schema_registry(
288+
synapse_client=syn,
289+
dcc="ad",
290+
datatype="IndividualAnimalMetadataTemplate"
291+
)
292+
293+
# Step 1.5: Create initial test data with validation examples
294+
# Row 1: VALID - all required fields present and valid
295+
# Row 2: INVALID - missing required field 'genotype'
296+
# Row 3: INVALID - invalid enum value for 'sex' ("other" not in enum)
297+
test_data = pd.DataFrame({
298+
"individualID": ["ANIMAL001", "ANIMAL002", "ANIMAL003"],
299+
"species": ["Mouse", "Mouse", "Mouse"],
300+
"sex": ["female", "male", "other"], # Row 3: invalid enum
301+
"genotype": ["5XFAD", None, "APOE4KI"], # Row 2: missing required field
302+
"genotypeBackground": ["C57BL/6J", "C57BL/6J", "C57BL/6J"],
303+
"modelSystemName": ["5XFAD", "5XFAD", "APOE4KI"],
304+
"dateBirth": ["2024-01-15", "2024-02-20", "2024-03-10"],
305+
"individualIdSource": ["JAX", "JAX", "JAX"],
306+
})
307+
308+
# Create a temporary CSV file with the test data
309+
temp_fd, temp_csv = tempfile.mkstemp(suffix=".csv")
310+
os.close(temp_fd)
311+
test_data.to_csv(temp_csv, index=False)
312+
313+
# Step 2: Create the curation task (this creates an empty template RecordSet)
314+
record_set, curation_task, data_grid = create_record_based_metadata_task(
315+
synapse_client=syn,
316+
project_id="syn123456789",
317+
folder_id="syn987654321",
318+
record_set_name="AnimalMetadata_Records",
319+
record_set_description="Animal study metadata with validation",
320+
curation_task_name="AnimalMetadata_Validation_Example",
321+
upsert_keys=["individualID"],
322+
instructions="Enter metadata for each animal. All required fields must be completed.",
323+
schema_uri=schema_uri,
324+
bind_schema_to_record_set=True,
325+
)
326+
327+
time.sleep(10)
328+
329+
print(f"Curation task created with ID: {curation_task.task_id}")
330+
print(f"RecordSet created with ID: {record_set.id}")
331+
332+
# Step 2.5: Upload the test data to the RecordSet
333+
record_set = RecordSet(id=record_set.id).get(synapse_client=syn)
334+
print("\nUploading test data to RecordSet...")
335+
record_set.path = temp_csv
336+
record_set = record_set.store(synapse_client=syn)
337+
print(f"Test data uploaded to RecordSet {record_set.id}")
338+
339+
# Step 3: Collaborators enter data via the web UI, OR you can create/export a Grid programmatically
340+
# For demonstration, here's the programmatic approach:
341+
print("\nCreating Grid session for data entry...")
342+
grid = Grid(record_set_id=record_set.id).create()
343+
print("Grid session created. Users can now enter data.")
344+
345+
# After data entry is complete (either via web UI or programmatically),
346+
# export the Grid to generate validation results
347+
print("\nExporting Grid to RecordSet to generate validation results...")
348+
grid.export_to_record_set()
349+
350+
# Clean up the Grid session
351+
grid.delete()
352+
print("Grid session exported and deleted.")
353+
354+
# Step 4: Refresh the RecordSet to get the latest validation results
355+
print("\nRefreshing RecordSet to retrieve validation results...")
356+
record_set = RecordSet(id=record_set.id).get()
357+
358+
# Step 5: Analyze validation results
359+
validation_df = record_set.get_detailed_validation_results()
360+
361+
if validation_df is not None:
362+
# Summary statistics
363+
total_rows = len(validation_df)
364+
valid_count = (validation_df['is_valid'] == True).sum() # noqa: E712
365+
invalid_count = (validation_df['is_valid'] == False).sum() # noqa: E712
366+
367+
print("\n=== Validation Summary ===")
368+
print(f"Total records: {total_rows}")
369+
print(f"Valid records: {valid_count} ({valid_count}/{total_rows})")
370+
print(f"Invalid records: {invalid_count} ({invalid_count}/{total_rows})")
371+
372+
# Group errors by type for better understanding
373+
if invalid_count > 0:
374+
invalid_rows = validation_df[validation_df['is_valid'] == False] # noqa: E712
375+
376+
# Export detailed error report for review
377+
error_report = invalid_rows[['row_index', 'validation_error_message', 'all_validation_messages']]
378+
error_report_path = "validation_errors_report.csv"
379+
error_report.to_csv(error_report_path, index=False)
380+
print(f"\nDetailed error report saved to: {error_report_path}")
381+
382+
# Show first few errors as examples
383+
print("\n=== Sample Validation Errors ===")
384+
for idx, row in error_report.head(3).iterrows():
385+
print(f"\nRow {row['row_index']}:")
386+
print(f" Error: {row['validation_error_message']}")
387+
print(f" ValidationError: {row['all_validation_messages']}")
388+
389+
# Clean up temporary file
390+
if os.path.exists(temp_csv):
391+
os.unlink(temp_csv)
392+
```
393+
394+
In this example you would expect to get results like:
395+
396+
```
397+
=== Sample Validation Errors ===
398+
399+
Row 0:
400+
Error: expected type: String, found: Long
401+
ValidationError: ["#/dateBirth: expected type: String, found: Long"]
402+
403+
Row 1:
404+
Error: 2 schema violations found
405+
ValidationError: ["#/genotype: expected type: String, found: Null","#/dateBirth: expected type: String, found: Long"]
406+
407+
Row 2:
408+
Error: 2 schema violations found
409+
ValidationError: ["#/dateBirth: expected type: String, found: Long","#/sex: other is not a valid enum value"]
410+
```
411+
412+
**Key points about validation results:**
413+
414+
- **Automatic generation**: Validation results are created automatically when you export data from a Grid session with a bound schema
415+
- **Row-level detail**: Each row in your RecordSet gets its own validation status and error messages
416+
- **Multiple violations**: The `all_validation_messages` column contains all schema violations for a row, not just the first one
417+
- **Iterative correction**: Use the validation results to identify issues, make corrections in the Grid, export again, and re-validate
418+
419+
### When validation results are available
420+
421+
Validation results are only available after:
422+
1. A JSON schema has been bound to the RecordSet (set `bind_schema_to_record_set=True` when creating the task)
423+
2. Data has been entered through a Grid session
424+
3. **The Grid session has been exported back to the RecordSet** - This is the critical step that triggers validation and populates the `validation_file_handle_id` attribute
425+
426+
The export can happen in two ways:
427+
- **Via the Synapse web UI**: Users click the export/save button in the Grid interface
428+
- **Programmatically**: Call `grid.export_to_record_set()` after creating a Grid session
429+
430+
If `get_detailed_validation_results()` returns `None`, the most common reason is that the Grid session hasn't been exported yet. Check that `record_set.validation_file_handle_id` is not `None` after exporting.
431+
181432
## Additional utilities
182433

183434
### Validate schema binding on folders
@@ -227,6 +478,9 @@ for curation_task in CurationTask.list(
227478
- [query_schema_registry][synapseclient.extensions.curator.query_schema_registry] - Search for schemas in the registry
228479
- [create_record_based_metadata_task][synapseclient.extensions.curator.create_record_based_metadata_task] - Create RecordSet-based curation workflows
229480
- [create_file_based_metadata_task][synapseclient.extensions.curator.create_file_based_metadata_task] - Create EntityView-based curation workflows
481+
- [RecordSet.get_detailed_validation_results][synapseclient.models.RecordSet.get_detailed_validation_results] - Get detailed validation results for RecordSet data
482+
- [Grid.create][synapseclient.models.curation.Grid.create] - Create a Grid session from a RecordSet
483+
- [Grid.export_to_record_set][synapseclient.models.curation.Grid.export_to_record_set] - Export Grid data back to RecordSet and generate validation results
230484
- [Folder.bind_schema][synapseclient.models.Folder.bind_schema] - Bind schemas to folders
231485
- [Folder.validate_schema][synapseclient.models.Folder.validate_schema] - Validate folder schema compliance
232486
- [CurationTask.list][synapseclient.models.CurationTask.list] - List curation tasks in a project

docs/reference/experimental/async/curator.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ at your own risk.
2525
- get_async
2626
- store_async
2727
- delete_async
28+
- get_detailed_validation_results_async
2829
- get_acl_async
2930
- get_permissions_async
3031
- set_permissions_async

docs/reference/experimental/sync/curator.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ at your own risk.
2525
- get
2626
- store
2727
- delete
28+
- get_detailed_validation_results
2829
- get_acl
2930
- get_permissions
3031
- set_permissions

synapseclient/core/typing_utils.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
"""Typing utilities for optional dependencies.
2+
3+
This module provides type aliases for optional dependencies like pandas and numpy,
4+
allowing proper type checking without requiring these packages to be installed.
5+
"""
6+
7+
from typing import TYPE_CHECKING, Any
8+
9+
if TYPE_CHECKING:
10+
try:
11+
from pandas import DataFrame, Series
12+
except ImportError:
13+
DataFrame = Any # type: ignore[misc, assignment]
14+
Series = Any # type: ignore[misc, assignment]
15+
16+
try:
17+
import numpy as np
18+
except ImportError:
19+
np = Any # type: ignore[misc, assignment]
20+
21+
try:
22+
import networkx as nx
23+
except ImportError:
24+
nx = Any # type: ignore[misc, assignment]
25+
else:
26+
# At runtime, use object as a placeholder
27+
DataFrame = object
28+
Series = object
29+
np = object # type: ignore[misc, assignment]
30+
nx = object # type: ignore[misc, assignment]
31+
32+
__all__ = ["DataFrame", "Series", "np", "nx"]

synapseclient/core/upload/multipart_upload_async.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,6 @@
7979
Mapping,
8080
Optional,
8181
Tuple,
82-
TypeVar,
8382
Union,
8483
)
8584

@@ -107,6 +106,7 @@
107106
)
108107
from synapseclient.core.otel_config import get_tracer
109108
from synapseclient.core.retry import with_retry_time_based
109+
from synapseclient.core.typing_utils import DataFrame as DATA_FRAME_TYPE
110110
from synapseclient.core.upload.upload_utils import (
111111
copy_md5_fn,
112112
copy_part_request_body_provider_fn,
@@ -123,8 +123,6 @@
123123
if TYPE_CHECKING:
124124
from synapseclient import Synapse
125125

126-
DATA_FRAME_TYPE = TypeVar("pd.DataFrame")
127-
128126
# AWS limits
129127
MAX_NUMBER_OF_PARTS = 10000
130128
MIN_PART_SIZE = 5 * MB

synapseclient/core/upload/upload_utils.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
import math
44
import re
55
from io import BytesIO, StringIO
6-
from typing import Any, Dict, Optional, TypeVar, Union
6+
from typing import Any, Dict, Optional, Union
77

8-
DATA_FRAME_TYPE = TypeVar("pd.DataFrame")
8+
from synapseclient.core.typing_utils import DataFrame as DATA_FRAME_TYPE
99

1010

1111
def get_partial_dataframe_chunk(

0 commit comments

Comments
 (0)