Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 34 additions & 34 deletions docs/explanations/curator_data_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,13 +55,13 @@ These columns must be present in your CSV data model:
- [Description](#description)
- [Valid Values](#valid-values)
- [Required](#required)
- [Parent](#parent)
- [Validation Rules](#validation-rules)
- [IsTemplate](#validation-rules)

Defining data types:

- Put a unique data type name in the `Attribute` column.
- Put the value `DataType` in the `Parent` column.
Copy link
Member

@thomasyu888 thomasyu888 Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a Product perspective, I just wanted to confirm that having the "Parent" column will not cause this code to fail. Is this correct?

Copy link
Contributor Author

@andrewelamb andrewelamb Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasyu888 I realized I hadn't updated the documentation, this is fixed now :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andrewelamb, the question remains, will having the Parent column be an issue at all?

Copy link
Contributor Author

@andrewelamb andrewelamb Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parent is now an optional column. In fact, technically, anything is an optional column. We have it that way so that data modelers can add whatever optional columns they wish, and the Curator Extension will just ignore them.

- Put the value `True` in the `IsTemplate` column.
- List at least one attribute in the `DependsOn` column (comma-separated).
- Optionally add a description to the `Description` column.

Expand Down Expand Up @@ -142,9 +142,9 @@ JSON Schema output:
}
```

### Parent
### IsTemplate

Put the value `DataType` in this column if this row is a data type. Other values are currently ignored. It is currently used to find all the data types in the data model.
Put the value `True` in this column if this row is a data type(template). It is currently used to find all the data types in the data model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewelamb what's the behavior if both Parent and IsTemplate columns are present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For creating JSON Schema, the Parent column will now be ignored. (It can still be used for other purposes, like the visualization tool, which was its original intended purpose from what I understand).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got, thanks!


### columnType

Expand All @@ -164,11 +164,11 @@ Must be one of:

Data Model:

| Attribute | DependsOn | columnType | Parent |
|-----------|-------------------|-------------|----------|
| Patient | "Gender, Hobbies" | | DataType |
| Gender | | string | |
| Hobbies | | string_list | |
| Attribute | DependsOn | columnType | IsTemplate |
|-----------|-------------------|-------------|------------|
| Patient | "Gender, Hobbies" | | True |
| Gender | | string | |
| Hobbies | | string_list | |

JSON Schema output:

Expand Down Expand Up @@ -213,11 +213,11 @@ The format of this attribute. See [format](https://json-schema.org/understanding

Data Model:

| Attribute | DependsOn | columnType | Format | Parent |
|-----------------|----------------------|-------------|--------|----------|
| Patient | "Gender, Birth Date" | | | DataType |
| Gender | | string | | |
| Birth Date | | string | date | |
| Attribute | DependsOn | columnType | Format | IsTemplate |
|-----------------|----------------------|-------------|--------|------------|
| Patient | "Gender, Birth Date" | | | True |
| Gender | | string | | |
| Birth Date | | string | date | |

JSON Schema output:

Expand Down Expand Up @@ -246,11 +246,11 @@ The regex pattern this attribute must match. The type of this attribute must be

Data Model:

| Attribute | DependsOn | columnType | Pattern | Parent |
|-----------|---------------|-------------|---------|----------|
| Patient | "Gender, ID" | | | DataType |
| Gender | | string | | |
| ID | | string | [a-f] | |
| Attribute | DependsOn | columnType | Pattern | IsTemplate |
|-----------|---------------|-------------|---------|------------|
| Patient | "Gender, ID" | | | True |
| Gender | | string | | |
| ID | | string | [a-f] | |

JSON Schema output:

Expand Down Expand Up @@ -279,12 +279,12 @@ The range that this attribute's numeric values must fall within. The type of thi

Data Model:

| Attribute | DependsOn | columnType | Minimum | Maximum | Parent |
|--------------|-----------------------------|-------------|---------|---------|----------|
| Patient | "Age, Weight, Health Score" | | | | DataType |
| Age | | integer | 0 | 120 | |
| Weight | | number | 0.0 | | |
| Health Score | | number | 0.0 | 1.0 | |
| Attribute | DependsOn | columnType | Minimum | Maximum | IsTemplate |
|--------------|-----------------------------|-------------|---------|---------|------------|
| Patient | "Age, Weight, Health Score" | | | | True |
| Age | | integer | 0 | 120 | |
| Weight | | number | 0.0 | | |
| Health Score | | number | 0.0 | 1.0 | |

JSON Schema output:

Expand Down Expand Up @@ -334,23 +334,23 @@ If you have an existing data model using any of the following validation rules,

## Conditional dependencies

The `DependsOn`, `Valid Values` and `Parent` columns can be used together to flexibly define conditional logic for determining the relevant attributes for a data type.
The `DependsOn`, `Valid Values` and `IsTemplate` columns can be used together to flexibly define conditional logic for determining the relevant attributes for a data type.

In this example we have the `Patient` data type. The `Patient` can be diagnosed as healthy or with cancer. For Patients with cancer we also want to collect info about their cancer type, and any cancers in their family history.

Data Model:

| Attribute | DependsOn | Valid Values | Required | columnType | Parent |
|----------------|-------------------------------|---------------------|----------|-------------|----------|
| Patient | "Diagnosis" | | | | DataType |
| Diagnosis | | "Healthy, Cancer" | True | string | |
| Cancer | "Cancer Type, Family History" | | | | |
| Cancer Type | | "Brain, Lung, Skin" | True | string | |
| Family History | | "Brain, Lung, Skin" | True | string_list | |
| Attribute | DependsOn | Valid Values | Required | columnType | IsTemplate |
|----------------|-------------------------------|---------------------|----------|-------------|------------|
| Patient | "Diagnosis" | | | | True |
| Diagnosis | | "Healthy, Cancer" | True | string | |
| Cancer | "Cancer Type, Family History" | | | | |
| Cancer Type | | "Brain, Lung, Skin" | | string | |
| Family History | | "Brain, Lung, Skin" | | string_list | |

To demonstrate this, see the above example with the `Patient` and `Cancer` data types:

- `Patient` is a data type, but `Cancer` is not, as defined by the `Parent` column.
- `Patient` is a data type, but `Cancer` is not, as defined by the `IsTemplate` column.
- `Diagnosis` is an attribute of `Patient`.
- `Diagnosis` has `Valid Values` of `Healthy` and `Cancer`.
- `Cancer` is also a data type.
Expand Down
85 changes: 73 additions & 12 deletions synapseclient/extensions/curator/schema_generation.py
Original file line number Diff line number Diff line change
Expand Up @@ -684,6 +684,10 @@ def gather_csv_attributes_relationships(
attr_rel_dictionary[attribute_name]["Relationships"].update(
{relationship: parsed_rel_entry}
)
is_template_dict = self.parse_is_template(attr)
attr_rel_dictionary[attribute_name]["Relationships"].update(
is_template_dict
)
if model_includes_column_type:
column_type_dict = self.parse_column_type(attr)
attr_rel_dictionary[attribute_name]["Relationships"].update(
Expand All @@ -710,6 +714,7 @@ def gather_csv_attributes_relationships(
attr_rel_dictionary[attribute_name]["Relationships"].update(
pattern_dict
)

return attr_rel_dictionary

def parse_column_type(self, attr: dict) -> dict:
Expand Down Expand Up @@ -851,6 +856,40 @@ def parse_csv_model(

return model_dict

def parse_is_template(self, attribute_dict: dict) -> dict[str, bool]:
"""Parse the IsTemplate value for a given attribute.

Args:
attribute_dict: The attribute dictionary.

Returns:
dict: A dictionary containing the parsed IsTemplate value.

Raises:
ValueError: If the IsTemplate value is not a boolean.
"""
from pandas import isna

is_template_value = attribute_dict.get("IsTemplate")

if isna(is_template_value):
template_value = False
elif isinstance(is_template_value, str):
if is_template_value.lower() == "true":
template_value = True
else:
template_value = False
else:
try:
template_value = bool(is_template_value)
except ValueError as exception:
raise ValueError(
f"The IsTemplate value: {is_template_value} is not boolean, "
"please correct this value in the data model."
) from exception

return {"IsTemplate": template_value}


class DataModelJSONLDParser:
"""DataModelJSONLDParser"""
Expand Down Expand Up @@ -1118,7 +1157,6 @@ def gather_jsonld_attributes_relationships(self, model_jsonld: list[dict]) -> di
attr_rel_dictionary[attr_key]["Relationships"].update(
{rel_csv_header: parsed_rel_entry}
)

elif (
rel_vals["jsonld_key"] in entry.keys()
and not rel_vals["csv_header"]
Expand Down Expand Up @@ -1935,6 +1973,22 @@ def _get_node_label(
return self.get_node_label(node_display_name)
raise ValueError("Either 'node_label' or 'node_display_name' must be provided.")

def get_node_is_template(
self, node_label: Optional[str] = None, node_display_name: Optional[str] = None
) -> bool:
"""Check if a given node is a template or not

Args:
node_label: Label of the node for which you need to look up.
node_display_name: Display name of the node for which you want look up.
Returns:
True: If the given node is a template
"""
node_label = self._get_node_label(node_label, node_display_name)
rel_node_label = self.dmr.get_relationship_value("IsTemplate", "node_label")
node_is_template = self.graph.nodes[node_label][rel_node_label]
return node_is_template


@dataclass_json
@dataclass
Expand Down Expand Up @@ -2048,7 +2102,6 @@ def __init__(self, graph: MULTI_GRAPH_TYPE, logger: Logger, output_path: str = "

class_template = ClassTemplate()
self.class_template = json.loads(class_template.to_json())
self.logger = logger

def get_edges_associated_with_node(
self, node: str
Expand Down Expand Up @@ -2279,15 +2332,15 @@ def add_contexts_to_entries(self, template: dict) -> dict:
if rel_key:
rel_key = rel_key[0]
# If the current relationship can be defined with a 'node_attr_dict'
if "node_attr_dict" in self.rel_dict[rel_key].keys():
if "node_attr_dict" in self.rel_dict[rel_key]:
try:
# if possible pull standard function to get node information
rel_func = self.rel_dict[rel_key]["node_attr_dict"]["standard"]
except Exception: # pylint:disable=bare-except
# if not pull default function to get node information
rel_func = self.rel_dict[rel_key]["node_attr_dict"]["default"]

# Add appropritae contexts that have been removed in previous steps
# Add appropriate contexts that have been removed in previous steps
# (for JSONLD) or did not exist to begin with (csv)
if (
rel_key == "id"
Expand All @@ -2296,7 +2349,7 @@ def add_contexts_to_entries(self, template: dict) -> dict:
):
template[jsonld_key] = "bts:" + template[jsonld_key]
elif (
rel_key == "required"
self.rel_dict[rel_key].get("type") == bool
and rel_func == convert_bool_to_str
and "sms" not in str(template[jsonld_key]).lower()
):
Expand Down Expand Up @@ -2971,6 +3024,19 @@ def define_data_model_relationships(self) -> dict:
"standard": convert_bool_to_str,
},
},
"IsTemplate": {
"jsonld_key": "sms:IsTemplate",
"csv_header": "IsTemplate",
"node_label": "IsTemplate",
"type": bool,
"jsonld_default": "sms:false",
"required_header": False,
"edge_rel": False,
"node_attr_dict": {
"default": False,
"standard": convert_bool_to_str,
},
},
"subClassOf": {
"jsonld_key": "rdfs:subClassOf",
"csv_header": "Parent",
Expand All @@ -2980,7 +3046,7 @@ def define_data_model_relationships(self) -> dict:
"jsonld_default": [{"@id": "bts:Thing"}],
"type": list,
"edge_rel": True,
"required_header": True,
"required_header": False,
},
"validationRules": {
"jsonld_key": "sms:validationRules",
Expand Down Expand Up @@ -5630,12 +5696,7 @@ def generate_jsonschema(
# Gets all data types if none are specified
if data_types is None or len(data_types) == 0:
data_types = [
dmge.get_node_label(node[0])
for node in [
(k, v)
for k, v in parsed_data_model.items()
if v["Relationships"].get("Parent") == ["DataType"]
]
node for node in dmge.find_classes() if dmge.get_node_is_template(node)
]

if len(data_types) != 1 and output is not None and output.endswith(".json"):
Expand Down
Loading
Loading