Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 53 additions & 37 deletions docs/explanations/curator_data_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ The CSV data model described in this tutorial formalizes this structure:

Here is the Patient described above represented as a CSV data model:

| Attribute | DependsOn |
|---|---|
| Attribute | DependsOn |
|-----------|---------------------|
| Patient | "Age, Gender, Name" |
| Age | |
| Gender | |
Expand Down Expand Up @@ -48,9 +48,20 @@ The end goal is to create a JSON Schema that can be used in Curator. A JSON Sche

Note: Individual columns are covered later on this page.

These columns must be present in your CSV data model:

- `Attribute`
- `DependsOn`
- `Description`
- `Valid Values`
- `Required`
- `Parent`
- `Validation Rules`

Defining data types:

- Put a unique data type name in the `Attribute` column.
- Put the value `DataType` in the `Parent` column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you confirm whether this step is required? If I remember correctly, the Parent column is only used for data visualization, so it shouldn’t be required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh never mind. I tried without Parent column and got an error:

 File "/Users/lpeng/code/synapsePythonClient/synapseclient/extensions/curator/schema_generation.py", line 606, in check_schema_definition
    raise ValueError(
    ...<2 lines>...
    )
ValueError: Schema extension headers: {'DependsOn', 'Format', 'Description', 'Source', 'DependsOn Component', 'Pattern', 'Valid Values', 'Maximum', 'Required', 'Minimum', 'Attribute', 'Properties', 'columnType', 'Validation Rules'} do not match required schema headers: ['Attribute', 'Description', 'Valid Values', 'DependsOn', 'Required', 'Parent', 'Validation Rules']
(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's required in the sense that it's how we differentiate between actual data types and other things that have attributes in the dependsOn column but are only there for conditional dependencies. If you go down to the Conditional Dependencies section example, Patient is an actual data type, but Cancer isn't. They both have
dependsOn values. So the way to tell them apart is by setting the Parent value. This is used by generate_json_schema when the user doesn't submit a data type, which means to create a JSON Schema for every data type.

It's also required in the sense that the DataModelrelationships class throws an error if the column isn't present in the CSV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for Valid Values, the parent colunm also needs to be filled

It doesn't for the purpose of creating JSON Schema (I just tried).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay got it. Thank you for explaining.. Maybe we should just change Parent column to DataType and just fill it with True or False in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would probably be the best way to do it if we were starting from scratch :)

- List at least one attribute in the `DependsOn` column (comma-separated).
- Optionally add a description to the `Description` column.

Expand Down Expand Up @@ -79,8 +90,8 @@ Set of possible values for the current attribute. This attribute will be an enum
Data Model:

| Attribute | DependsOn | Valid Values |
|---|---|---|
| Patient | "Gender" | |
|-----------|-----------|-----------------------|
| Patient | "Gender" | |
| Gender | | "Female, Male, Other" |

JSON Schema output:
Expand All @@ -107,8 +118,8 @@ Note: Leaving this empty is the equivalent of `False`.
Data Model:

| Attribute | DependsOn | Required |
|---|---|---|
| Patient | "Gender, Age" | |
|-----------|----------------|----------|
| Patient | "Gender, Age" | |
| Gender | | True |
| Age | | False |

Expand All @@ -131,6 +142,10 @@ JSON Schema output:
}
```

### Parent

This is mostly a remnant of the Schematic data model. It is currently used to find all the data types in the data model. Put the value `DataType` in this column if this row is a data type. Other vlaues are currently ignored.

### columnType

The data type of this attribute. See [type](https://json-schema.org/understanding-json-schema/reference/type).
Expand All @@ -147,11 +162,11 @@ Must be one of:

Data Model:

| Attribute | DependsOn | columnType |
|---|---|---|
| Patient | "Gender, Hobbies" | |
| Gender | | string |
| Hobbies | | string_list |
| Attribute | DependsOn | columnType | Parent |
|-----------|-------------------|-------------|----------|
| Patient | "Gender, Hobbies" | | DataType |
| Gender | | string | |
| Hobbies | | string_list | |

JSON Schema output:

Expand Down Expand Up @@ -196,11 +211,11 @@ The format of this attribute. See [format](https://json-schema.org/understanding

Data Model:

| Attribute | DependsOn | columnType | Format |
|---|---|---|---|
| Patient | "Gender, Birth Date" | | |
| Gender | | string | |
| Birth Date | | string | date |
| Attribute | DependsOn | columnType | Format | Parent |
|-----------------|----------------------|-------------|--------|----------|
| Patient | "Gender, Birth Date" | | | DataType |
| Gender | | string | | |
| Birth Date | | string | date | |

JSON Schema output:

Expand Down Expand Up @@ -229,11 +244,11 @@ The regex pattern this attribute must match. The type of this attribute must be

Data Model:

| Attribute | DependsOn | columnType | Pattern |
|---|---|---|---|
| Patient | "Gender, ID" | | |
| Gender | | string | |
| ID | | string | [a-f] |
| Attribute | DependsOn | columnType | Pattern | Parent |
|-----------|---------------|-------------|---------|----------|
| Patient | "Gender, ID" | | | DataType |
| Gender | | string | | |
| ID | | string | [a-f] | |

JSON Schema output:

Expand Down Expand Up @@ -262,12 +277,12 @@ The range that this attribute's numeric values must fall within. The type of thi

Data Model:

| Attribute | DependsOn | columnType | Minimum | Maximum |
|---|---|---|---|---|
| Patient | "Age, Weight, Health Score" | | | |
| Age | | integer | 0 | 120 |
| Weight | | number | 0.0 | |
| Health Score | | number | 0.0 | 1.0 |
| Attribute | DependsOn | columnType | Minimum | Maximum | Parent |
|--------------|-----------------------------|-------------|---------|---------|----------|
| Patient | "Age, Weight, Health Score" | | | | DataType |
| Age | | integer | 0 | 120 | |
| Weight | | number | 0.0 | | |
| Health Score | | number | 0.0 | 1.0 | |

JSON Schema output:

Expand Down Expand Up @@ -301,9 +316,9 @@ JSON Schema output:

### Validation Rules (deprecated)

This is a remnant from Schematic. It is still used (for now) to translate certain validation rules to other JSON Schema keywords.
This is a remnant from Schematic. It is still required and in use (for now) to translate certain validation rules to other JSON Schema keywords.

If you are starting a new data model, DO NOT use this column.
If you are starting a new data model, DO NOT fill out this column, just leave it blank.

If you have an existing data model using any of the following validation rules, follow these instructions to update it:

Expand All @@ -315,26 +330,27 @@ If you have an existing data model using any of the following validation rules,

## Conditional dependencies

The `DependsOn` and `Valid Values` columns can be used together to flexibly define conditional logic for determining the relevant attributes for a data type.
The `DependsOn`, `Valid Values` and `Parent` columns can be used together to flexibly define conditional logic for determining the relevant attributes for a data type.

In this example we have the `Patient` data type. The `Patient` can be diagnosed as healthy or with cancer. For Patients with cancer we also want to collect info about their cancer type, and any cancers in their family history.

Data Model:

| Attribute | DependsOn | Valid Values | Required | columnType |
|---|---|---|---|---|
| Patient | "Diagnosis" | | | |
| Diagnosis | | "Healthy, Cancer" | True | string |
| Cancer | "Cancer Type, Family History" | | | |
| Cancer Type | | "Brain, Lung, Skin" | True | string |
| Family History | | "Brain, Lung, Skin" | True | string_list |
| Attribute | DependsOn | Valid Values | Required | columnType | Parent |
|----------------|-------------------------------|---------------------|----------|-------------|----------|
| Patient | "Diagnosis" | | | | DataType |
| Diagnosis | | "Healthy, Cancer" | True | string | |
| Cancer | "Cancer Type, Family History" | | | | |
| Cancer Type | | "Brain, Lung, Skin" | True | string | |
| Family History | | "Brain, Lung, Skin" | True | string_list | |

To demonstrate this, see the above example with the `Patient` and `Cancer` data types:

- `Diagnosis` is an attribute of `Patient`.
- `Diagnosis` has `Valid Values` of `Healthy` and `Cancer`.
- `Cancer` is also a data type.
- `Cancer Type` and `Family History` are attributes of `Cancer` and are both required.
- `Patient` is a data type, but `Cancer` is not, as defined by the `Parent` column.

As a result of the above data model, in the JSON Schema:

Expand Down
6 changes: 3 additions & 3 deletions synapseclient/extensions/curator/schema_generation.py
Original file line number Diff line number Diff line change
Expand Up @@ -2955,7 +2955,7 @@ def define_data_model_relationships(self) -> dict:
"edge_dir": "out",
"type": list,
"edge_rel": True,
"required_header": True,
"required_header": False,
},
"required": {
"jsonld_key": "sms:required",
Expand Down Expand Up @@ -3004,7 +3004,7 @@ def define_data_model_relationships(self) -> dict:
"edge_dir": "in",
"type": list,
"edge_rel": True,
"required_header": True,
"required_header": False,
},
"isPartOf": {
"jsonld_key": "schema:isPartOf",
Expand All @@ -3023,7 +3023,7 @@ def define_data_model_relationships(self) -> dict:
"node_label": "uri",
"type": str,
"edge_rel": False,
"required_header": True,
"required_header": False,
"node_attr_dict": {
"default": get_label_from_display_name,
"standard": get_label_from_display_name,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Attribute,Description,Valid Values,DependsOn,Required,Parent,Validation Rules
datatype,,,attribute,,DataType,
attribute,,,,TRUE,DataProperty,
37 changes: 37 additions & 0 deletions tests/unit/synapseclient/extensions/unit_test_curator.py
Original file line number Diff line number Diff line change
Expand Up @@ -1947,6 +1947,11 @@ def setUp(self):
"schema_files",
"data_models/example.model.csv",
)
self.minimal_test_schema_path = os.path.join(
os.path.dirname(__file__),
"schema_files",
"data_models/minimal_model.csv",
)

def test_generate_jsonschema_from_csv(self):
"""Test generate_jsonschema from CSV file."""
Expand Down Expand Up @@ -1980,6 +1985,38 @@ def test_generate_jsonschema_from_csv(self):
finally:
shutil.rmtree(temp_dir)

def test_generate_jsonschema_from_minimal_csv(self):
"""Test generate_jsonschema from a minimal CSV file."""
# GIVEN a CSV schema file
temp_dir = tempfile.mkdtemp()
try:
# WHEN I generate JSON schemas
schemas, file_paths = generate_jsonschema(
data_model_source=self.minimal_test_schema_path,
output=temp_dir,
data_types=None,
data_model_labels="class_label",
synapse_client=self.syn,
)

# THEN schemas should be generated
assert isinstance(schemas, list)
assert len(schemas) > 0
assert isinstance(file_paths, list)
assert len(file_paths) == len(schemas)

# AND files should exist
for file_path in file_paths:
assert os.path.exists(file_path), f"Expected file at {file_path}"

# AND each schema should be valid JSON Schema
for schema in schemas:
assert isinstance(schema, dict)
assert "$schema" in schema
assert "properties" in schema
finally:
shutil.rmtree(temp_dir)

def test_generate_jsonschema_from_jsonld(self):
"""Test generate_jsonschema from JSONLD file."""
# GIVEN a JSONLD file (first generate it from CSV)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,12 +39,9 @@ def test_define_required_csv_headers(self, dmr: DataModelRelationships):
"Description",
"Valid Values",
"DependsOn",
"DependsOn Component",
"Required",
"Parent",
"Validation Rules",
"Properties",
"Source",
]

@pytest.mark.parametrize("edge", [True, False], ids=["True", "False"])
Expand Down
Loading