diff --git a/LICENSE b/LICENSE index 32686c9a9..a7e3d5b23 100644 --- a/LICENSE +++ b/LICENSE @@ -1,6 +1,6 @@ MIT License -Copyright (c) 2021 Sage Bionetworks +Copyright (c) 2025 Sage Bionetworks Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/README.md b/README.md index 91de04f5a..850164a57 100644 --- a/README.md +++ b/README.md @@ -570,12 +570,23 @@ For internal developers with access to SigNoz Cloud, you can obtain an ingestion # Contributors -Main contributors and developers: +Sage main contributors and developers: + +- [Gianna Jordan](https://github.com/giajordan) +- [Lingling Peng](https://github.com/linglp) +- [Bryan Fauble](https://github.com/BryanFauble) +- [Andrew Lamb](https://github.com/andrewelamb) +- [Brad Macdonald](https://github.com/BWMac) - [Milen Nikolov](https://github.com/milen-sage) + +## Alumni - [Mialy DeFelice](https://github.com/mialy-defelice) - [Sujay Patil](https://github.com/sujaypatil96) - [Bruno Grande](https://github.com/BrunoGrandePhD) -- [Robert Allaway](https://github.com/allaway) -- [Gianna Jordan](https://github.com/giajordan) -- [Lingling Peng](https://github.com/linglp) +- [Jason Hwee](https://github.com/hweej) +- [Xengie Doan](https://github.com/xdoan) +- [James Eddy](https://github.com/jaeddy) +- [Yooree Chae](https://github.com/ychae) + +See all [contributors](https://github.com/Sage-Bionetworks/schematic/graphs/contributors) diff --git a/docs/source/asset_store.rst b/docs/source/asset_store.rst new file mode 100644 index 000000000..80cbd008b --- /dev/null +++ b/docs/source/asset_store.rst @@ -0,0 +1,138 @@ +Setting up your asset store +=========================== + +.. note:: + + You can ignore this section if you are just trying to contribute manifests. + +This document covers the minimal recommended elements needed in Synapse to interface with the Data Curator App (DCA) and provides options for Synapse project layout. + +There are two options for setting up a DCC Synapse project: + +1. **Distributed Projects**: Each team of DCC contributors has its own Synapse project that stores the team's datasets. +2. **Single Project**: All DCC datasets are stored in the same Synapse project. + +In each of these project setups, there are two ways you can lay out your data: + +1. **Flat Data Layout**: All top level folders structured under the project + + .. code-block:: shell + + my_flat_project + ├── biospecimen + └── clinical + +2. **Hierarchical Data Layout**: Top level folders are stored within nested folders annotated with ``contentType: dataset`` + + .. note:: + + This requires you to add the column ``contentType`` to your fileview schema. + + .. code-block:: shell + + my_heirarchical_project + ├── biospecimen + │ ├── experiment_1 <- annotated + │ └── experiment_2 <- annotated + └── clinical + ├── batch_1 <- annotated + └── batch_2 <- annotated + + +Option 1: Distributed Synapse Projects +-------------------------------------- + +Pick **option 1** if you answer "yes" to one or more of the following questions: + +- Does the DCC have multiple contributing institutions/labs, each with different data governance and access controls? +- Does the DCC have multiple institutions with limited cross-institutional sharing? +- Will contributors submit more than 100 datasets per release or per month? +- Are you not willing to annotate each DCC dataset folder with the annotation ``contentType:dataset``? + +Access & Project Setup - Multiple Contributing Projects +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +1. Create a DCC Admin Team with admin permissions. +2. Create a Team for each data contributing institution. Begin with a "Test Team" if all teams are not yet identified. +3. Create a Synapse Project for each institution and grant the respective team **Edit** level access. + + - E.g., for institutions A, B, and C, create Projects A, B, and C with Teams A, B, and C. Team A has **Edit** access to Project A, etc. + +4. Within each project, create "top level folders" in the **Files** tab for each dataset type. +5. Create another Synapse Project (e.g., MyDCC) containing the main **Fileview** that includes in the scope all the DCC projects. + + - Ensure all teams have **Download** level access to this file view. + - Include both file and folder entities and add **ALL default columns**. + +.. note:: + + Note: If you want to upload data according to hierachical data layout, you can still use + distributed projects, just the ``contentType`` column to your fileview, and you will have + to annotate your top level folders with ``contentType:dataset``. + + +Option 2: Single Synapse Project +-------------------------------- + +Pick **option 2** if you don't select option 1 and you answer "yes" to any of these questions: + +- Does the DCC have a project with pre-existing datasets in a complex folder hierarchy? +- Does the DCC envision collaboration on the same dataset collection across multiple teams with shared access controls? +- Are you willing to set up local access control for each dataset folder and annotate each with ``contentType: dataset``? + +If neither option fits, select option 1. + + +Access & Project Setup - Single Contributing Project +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +1. Create a Team for each data contributing institution. +2. Create a single Synapse Project (e.g., MyDCC). +3. Within this project, create dataset folders for each contributor. Organize them as needed. + + - Annotate ``contentType: dataset`` for each top level folder, which should not nest inside other dataset folders and must have unique names. + Taking the above example, you cannot have something like this: + + .. code-block:: shell + + my_heirarchical_project + ├── biospecimen + │ ├── experiment_1 <- annotated + │ └── experiment_2 <- annotated + └── clinical + ├── experiment_1 <- this is not allowed, because experiment_1 is duplicated + └── batch_2 <- annotated + +4. In MyDCC, create the main **DCC Fileview** with `MyDCC` as the scope. Add column ``contentType`` to the schema and grant teams **Download** level access. + + - Ensure all teams have **Download** level access to this file view. + - Add both file and folder entities and add **ALL default columns**. + +.. note:: + + You can technically use the flat data layout with a single project setup, but it is not recommended + as if you have different data contributors contributing similar datatypes, it would lead to a + proliferation of folders per contributor and data type. + +Synapse External Cloud Buckets Setup +------------------------------------ + +If DCC contributors require external cloud buckets, select one of the following configurations. For more information on how to +set this up on Synapse, view this documentation: https://help.synapse.org/docs/Custom-Storage-Locations.2048327803.html + +1. **Basic External Storage Bucket (Default)**: + + - Create an S3 bucket for Synapse uploads via web or CLI. Contributors will upload data without needing AWS credentials. + - Provision an S3 bucket, attach it to the Synapse project, and create folders for specific assay types. + +2. **Custom Storage Location**: + +This is an advanced setup for users that do not want to upload files directly via the Synapse API, but rather +create pointers to the data. + + - For large datasets or if contributors prefer cloud storage, enable uploads via AWS CLI or GCP CLI. + - Configure the custom storage location with an AWS Lambda or Google Cloud function for syncing. + - If using AWS, provision a bucket, set up Lambda sync, and assign IAM write access. + - For GCP, use Google Cloud function sync and obtain contributor emails for access. + +Finally, set up a `synapse-service-lambda` account for syncing external cloud buckets with Synapse, granting "Edit & Delete" permissions on the contributor's project. diff --git a/docs/source/cli_reference.rst b/docs/source/cli_reference.rst index a2bd78cd2..83cda3e2c 100644 --- a/docs/source/cli_reference.rst +++ b/docs/source/cli_reference.rst @@ -2,6 +2,45 @@ CLI Reference ============= +When you're using this tool ``-d`` flag is referring to the Synapse ID of a folder that would be found under the files tab +that contains a manifest and data. This would be referring to a "Top Level Folder". It is not required to provide a ``dataset_id`` +but if you're trying to pull existing annotations by using the ``-a`` flag and the manifest is file-based then you would +need to provide a ``dataset_id``. + + +Generate a new manifest as a Google Sheet +----------------------------------------- + + +.. code-block:: shell + + schematic manifest -c /path/to/config.yml get -dt -s + +Generate an existing manifest from Synapse +------------------------------------------ + +.. code-block:: shell + + schematic manifest -c /path/to/config.yml get -dt -d -s + +Validate a manifest +------------------- + +.. code-block:: shell + + schematic model -c /path/to/config.yml validate -dt -mp + +Submit a manifest as a file +--------------------------- + +.. code-block:: shell + + schematic model -c /path/to/config.yml submit -mp -d -vc -mrt file_only + + +In depth guide +-------------- + .. click:: schematic.__main__:main :prog: schematic :nested: full diff --git a/docs/source/conf.py b/docs/source/conf.py index 677de60a5..5749c5f45 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -13,6 +13,8 @@ import os import sys +import sphinx_rtd_theme + file_dir = os.path.dirname(__file__) sys.path.append(file_dir) import pathlib @@ -27,7 +29,7 @@ toml_metadata = _parse_toml(toml_file_path) project = toml_metadata["name"] -copyright = "2022, Sage Bionetworks" +copyright = "2024, Sage Bionetworks" author = toml_metadata["authors"] @@ -40,7 +42,7 @@ # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. -extensions = ["sphinx_click"] +extensions = ["sphinx_click", "sphinx_rtd_theme"] # Add any paths that contain templates here, relative to this directory. templates_path = ["_templates"] @@ -57,15 +59,21 @@ # This pattern also affects html_static_path and html_extra_path. exclude_patterns = [] +# The master toctree document. +master_doc = "index" # -- Options for HTML output ------------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. # -html_theme = "alabaster" +html_theme = "sphinx_rtd_theme" # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ["_static"] + +html_theme_options = { + "collapse_navigation": False, +} diff --git a/docs/source/configuration.rst b/docs/source/configuration.rst new file mode 100644 index 000000000..f8d458dbf --- /dev/null +++ b/docs/source/configuration.rst @@ -0,0 +1,86 @@ +.. _configuration: + +Configure Schematic +=================== + +This is an example config for Schematic. All listed values are those that are the default if a config is not used. Remove any fields in the config you don't want to change. +If you remove all fields from a section, the entire section should be removed including the header. +Change the values of any fields you do want to change. Please view the installation section for details on how to set some of this up. + +.. code-block:: yaml + + # This describes where assets such as manifests are stored + asset_store: + # This is when assets are stored in a synapse project + synapse: + # Synapse ID of the file view listing all project data assets. + master_fileview_id: "syn23643253" + # Path to the synapse config file, either absolute or relative to this file + config: ".synapseConfig" + # Base name that manifest files will be saved as + manifest_basename: "synapse_storage_manifest" + + # This describes information about manifests as it relates to generation and validation + manifest: + # Location where manifests will saved to + manifest_folder: "manifests" + # Title or title prefix given to generated manifest(s) + title: "example" + # Data types of manifests to be generated or data type (singular) to validate manifest against + data_type: + - "Biospecimen" + - "Patient" + + # Describes the location of your schema + model: + # Location of your schema jsonld, it must be a path relative to this file or absolute + location: "tests/data/example.model.jsonld" + + # This section is for using google sheets with Schematic + google_sheets: + # Path to the google service account creds, either absolute or relative to this file + service_acct_creds: "schematic_service_account_creds.json" + # When doing google sheet validation (regex match) with the validation rules. + # true is alerting the user and not allowing entry of bad values. + # false is warning but allowing the entry on to the sheet. + strict_validation: true + + +This document will go into detail what each of these configurations mean. + +Asset Store +----------- + +Synapse +~~~~~~~ +This describes where assets such as manifests are stored and the configurations of the asset store is described +under the asset store section. + +* master_fileview_id: Synapse ID of the file view listing all project data assets. +* config: Path to the synapse config file, either absolute or relative to this file. Note, if you use `synapse config` command, you will have to provide the full path to the configuration file. +* manifest_basename: Base name that manifest files will be saved as on Synapse. The Component will be appended to it so for example: `synapse_storage_manifest_biospecimen.csv` + +Manifest +-------- +This describes information about manifests as it relates to generation and validation. Note: some of these configurations can be overwritten by the CLI commands. + +* manifest_folder: Location where manifests will saved to. This can be a relative or absolute path on your local machine. +* title: Title or title prefix given to generated manifest(s). This is used to name the manifest file saved locally. +* data_type: Data types of manifests to be generated or data type (singular) to validate manifest against. If you wanted all the available manifests, you can input "all manifests" + + +Model +----- +Describes the location of your schema + +* location: This is the location of your schema jsonld, it must be a path relative to this file or absolute path. Currently URL's are NOT supported, so you will have to download the jsonld data model. Here is an example: https://raw.githubusercontent.com/ncihtan/data-models/v24.9.1/HTAN.model.jsonld + +Google Sheets +------------- +Schematic leverages the Google API to generate manifests. This section is for using google sheets with Schematic + +* service_acct_creds: Path to the google service account creds, either absolute or relative to this file. This is the path to the service account credentials file that you download from Google Cloud Platform. +* strict_validation: When doing google sheet validation (regex match) with the validation rules. + + * True is alerting the user and not allowing entry of bad values. + * False is warning but allowing the entry on to the sheet. diff --git a/docs/source/index.rst b/docs/source/index.rst index 2d235a77a..ceee712fb 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -3,10 +3,155 @@ You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. +.. _index: + Welcome to Schematic's documentation! ===================================== +.. warning:: + This documentation site is a work in progress, and the sublinks may change. Apologies for the inconvenience. + + +**SCHEMATIC** is an acronym for *Schema Engine for Manifest Ingress and Curation*. +The Python-based infrastructure provides a *novel* schema-based, metadata ingress ecosystem, +which is meant to streamline the process of biomedical dataset annotation, metadata validation, +and submission to a data repository for various data contributors. This tool is a recommened to be +used as a command line tool (CLI) and as an API. + +Schematic tackles these goals: + +- Ensure the highest quality structured data or metadata be contributed to Synapse. +- Provide excel templates that correspond to a data model that can be filled out by data contributors. +- Visualize and manage data models and their relationships with each other + +.. contents:: + :depth: 2 + :local: + +Important Concepts +------------------ + +.. important:: + + Before moving reading more about schematic, this section covers essential concepts relevant for using the Schematic tool effectively. + +Synapse FileViews +~~~~~~~~~~~~~~~~~ +Users are responsible for setting up a **FileView** that integrates with Schematic. Note that FileViews appear under the "Tables" tab in Synapse and can be named according to the project's needs. For instance, a FileView for the **Project A** could have a different name than a FileView for the **Project B**. + +For more information on Synapse projects, visit: + +- `Synapse projects `_ +- `Synapse annotations `_ + +Synapse Folders +~~~~~~~~~~~~~~~ + +Folders in Synapse allow users to organize data within projects. More details on uploading and organizing data can be found at `Synapse folders `_ + +Synapse Datasets +~~~~~~~~~~~~~~~~ + +This is an object in Synapse which appears under the "Dataset" tab and represents a user-defined collection of Synapse files and versions. https://help.synapse.org/docs/Datasets.2611281979.html + +JSON-LD +~~~~~~~ +JSON-LD is a lightweight Linked Data format. The usage of JSON-LD to capture our data models +extends beyond the creation, validation, and submission of annotations/manifests into Synapse +It can create relationships between different data models and, in the future, drive +transformation of data from one data model to another. Visualization of these data models +and their relationships is also possible which allows the community to see the depth of +connections between all the data uploaded into Synapse. + +Manifest +~~~~~~~~ + +A manifest is a structured file that contains metadata about files under a "top level folder". +The metadata includes information of the files such as data type and etc. +The manifest can also used to annotate the data on Synapse and create a file view +that enables the FAIR principles on each of the files in the "top level folder". + +Component/Data type +~~~~~~~~~~~~~~~~~~~ +"component" and "data type" are used interchangeably. The component/data type is determined from the specified JSON-LD data model. +If the string "component" exists in the depends on column, the "Attribute" value in that row is a data type. +Examples of a data type is "Biospecimen", "Patient": https://github.com/Sage-Bionetworks/schematic/blob/develop/tests/data/example.model.csv#L3. +Each data type/component should a manifest template that has different columns. + +Project Data Layout +~~~~~~~~~~~~~~~~~~~ + +Regardless of data layout, the data in your Synapse Project(s) are uploaded into Synapse Folders to be curated and annotated by schematic. +In both layouts listed below, the project administrators along with the data contributors may have preferences on how the +data is organized. The organization of your data is specified with the "Component / data type" attribute of your data model and +act as logical groupings for your data. Schematic has a concept of a ``dataset`` (parameters for the API/library/CLI), but this means +different things under these two layouts. + +* **Hierarchical**: The "dataset" parameter under this data layout is associated with any folder that has ``contentType: dataset`` annotated + and is often associated with a "dataset". +* **Flat**: The "dataset" parameter under this data layout is often referred to as "top level folders". + +In both of these layouts, these are really just groupings of resources. + + +Schematic services +------------------ + +The following are the four main endpoints that assist with the high-level goals outlined above, with additional goals to come. + +Manifest Generation +~~~~~~~~~~~~~~~~~~~ + +Provides a manifest template for users for a particular project or data type. If a project with annotations already exists, a semi-filled-out template can be provided to the user. This ensures they do not start from scratch. If there are no existing annotations and manifests, an empty manifest template is provided. + +Manifest Validation +~~~~~~~~~~~~~~~~~~~ + +Given a filled-out manifest: + +- The manifest is validated against the JSON-LD schema as it maps to GX rules. +- A ``jsonschema`` is generated from the data model. The data model can be in CSV, JSON-LD format, as input formats are decoupled from the internal data model representation within Schematic. +- A set of validation rules is defined in the data model. Some validation rules are implemented via GX; others are custom Python code. All validation rules have the same interface. +- Certain GX rules require looping through all projects a user has access to, or a specified scope of projects, to find other projects with manifests. +- Validation results are provided before the manifest file is uploaded into Synapse. + +Manifest Submission +~~~~~~~~~~~~~~~~~~~ + +Given a filled out manifest, this will allow you to submit the manifest to the "top level folder". +This is validates the manifest and... + +- If manifest is invalid, erorr messages will be returned. +- If the manifest is valid: + + - Stores the manifest in Synapse. + - Uploads the manifest as a Synapse File, Annotations on Files, and/or Synapse Table. + +More validation documentation can be found here: https://sagebionetworks.jira.com/wiki/spaces/SCHEM/pages/3302785036/Schematic+Validation + +Data Model Visualization +~~~~~~~~~~~~~~~~~~~~~~~~ + +These endpoints allows you to visulize your data models and their relationships with each other. + + +API reference +------------- + +For the entire Python API reference documentation, you can visit the docs here: https://sage-bionetworks.github.io/schematic/ + .. toctree:: :maxdepth: 1 + :hidden: + installation + asset_store + configuration + validation_rules + manifest_generation + manifest_validation + manifest_submission + tutorials + troubleshooting cli_reference + linkml diff --git a/docs/source/installation.rst b/docs/source/installation.rst new file mode 100644 index 000000000..37254877b --- /dev/null +++ b/docs/source/installation.rst @@ -0,0 +1,310 @@ +.. _installation: + +Installation +============ + +Installation Requirements +------------------------- + +- Your installed python version must be 3.9.0 ≤ version < 3.11.0 +- You need to be a registered and certified user on `synapse.org `_ + +.. note:: + To create Google Sheets files from Schematic, please follow our credential policy for Google credentials. You can find a detailed tutorial `Google Credentials Guide `_. + If you're using ``config.yml``, make sure to specify the path to ``schematic_service_account_creds.json`` (see the ``google_sheets > service_account_creds`` section for more information). + +Installation Guide For: Users +----------------------------- + +The instructions below assume you have already installed `python `_, with the release version meeting the constraints set in the `Installation Requirements`_ section, and do not have a Python environment already active. + +1. Verify your python version +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Ensure your python version meets the requirements from the `Installation Requirements`_ section using the following command: + +.. code-block:: shell + + python3 --version + +If your current Python version is not supported by Schematic, you can switch to the supported version using a tool like `pyenv `_. Follow the instructions in the pyenv documentation to install and switch between Python versions easily. + +.. note:: + You can double-check the current supported python version by opening up the `pyproject.toml `_ file in this repository and finding the supported versions of python in the script. + +2. Set up your virtual environment +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Once you are working with a python version supported by `schematic`, you will need to activate a virtual environment within which you can install the package. Below we will show how to create your virtual environment either with ``venv`` or with ``conda``. + +2a. Set up your virtual environment with ``venv`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Python 3 has built-in support for virtual environments with the ``venv`` module, so you no longer need to install ``virtualenv``: + +.. code-block:: shell + + python3 -m venv .venv + source .venv/bin/activate + +2b. Set up your virtual environment with ``conda`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +``conda`` is a powerful package and environment management tool that allows users to create isolated environments used particularly in data science and machine learning workflows. If you would like to manage your environments with ``conda``, continue reading: + +1. **Download your preferred ``conda`` installer**: Begin by `installing conda `_. We personally recommend working with Miniconda, which is a lightweight installer for ``conda`` that includes only ``conda`` and its dependencies. +2. **Execute the ``conda`` installer**: Once you have downloaded your preferred installer, execute it using ``bash`` or ``zsh``, depending on the shell configured for your terminal environment. For example: + + .. code-block:: shell + + bash Miniconda3-latest-MacOSX-arm64.sh + +3. **Verify your ``conda`` setup**: Follow the prompts to complete your setup. Then verify your setup by running the ``conda`` command. +4. **Create your ``schematic`` environment**: Begin by creating a fresh ``conda`` environment for ``schematic`` like so: + + .. code-block:: shell + + conda create --name 'schematicpy' python=3.10 + +5. **Activate the environment**: Once your environment is set up, you can now activate your new environment with ``conda``: + + .. code-block:: shell + + conda activate schematicpy + +3. Install ``schematic`` dependencies +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Install the package using `pip `_: + +.. code-block:: shell + + python3 -m pip install schematicpy + +If you run into ``ERROR: Failed building wheel for numpy``, the error might be able to resolve by upgrading pip. Please try to upgrade pip by: + +.. code-block:: shell + + pip3 install --upgrade pip + +4. Get your data model as a ``JSON-LD`` schema file +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Now you need a schema file, e.g. ``model.jsonld``, to have a data model that schematic can work with. While you can download a super basic `example data model `_, you'll probably be working with a DCC-specific data model. For non-Sage employees/contributors using the CLI, you might care only about the minimum needed artifact, which is the ``.jsonld``; locate and download only that from the right repo. + +Here are some example repos with schema files: + +- https://github.com/ncihtan/data-models/ +- https://github.com/nf-osi/nf-metadata-dictionary/ + +5. Obtain Google credential files +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Any function that interacts with a Google sheet (such as ``schematic manifest get``) requires Google Cloud credentials. + +1. **Option 1**: `Step-by-step `_ guide on how to create these credentials in Google Cloud. + - Depending on your institution's policies, your institutional Google account may or may not have the required permissions to complete this. A possible workaround is to use a personal or temporary Google account. + +.. warning:: + At the time of writing, Sage Bionetworks employees do not have the appropriate permissions to create projects with their Sage Bionetworks Google accounts. You would follow instructions using a personal Google account. + +2. **Option 2**: Ask your DCC/development team if they have credentials previously set up with a service account. + +Once you have obtained credentials, be sure that the json file generated is named in the same way as the ``service_acct_creds`` parameter in your ``config.yml`` file. You will find more context on the ``config.yml`` in section [6. Set up configuration files](#6-set-up-configuration-files). + +.. note:: + Running ``schematic init`` is no longer supported due to security concerns. To obtain ``schematic_service_account_creds.json``, please follow the `instructions `_. Schematic uses Google's API to generate Google sheet templates that users fill in to provide (meta)data. Most Google sheet functionality could be authenticated with service account. However, more complex Google sheet functionality requires token-based authentication. As browser support that requires the token-based authentication diminishes, we are hoping to deprecate token-based authentication and keep only service account authentication in the future. + +.. note:: + Use the ``schematic_service_account_creds.json`` file for the service account mode of authentication (*for Google services/APIs*). Service accounts are special Google accounts that can be used by applications to access Google APIs programmatically via OAuth2.0, with the advantage being that they do not require human authorization. + +.. _Set up configuration files: + +6. Set up configuration files +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following section will walk through setting up your configuration files with your credentials to allow for communication between ``schematic`` and the Synapse API. + +There are two main configuration files that need to be created and modified: + +- ``.synapseConfig`` +- ``config.yml`` + +**Create and modify the ``.synapseConfig``** + +The ``.synapseConfig`` file is what enables communication between ``schematic`` and the Synapse API using your credentials. You can automatically generate a ``.synapseConfig`` file by running the following in your command line and following the prompts. + +.. tip:: + You can generate a new authentication token on the Synapse website by going to ``Account Settings`` > ``Personal Access Tokens``. + +.. code-block:: shell + + synapse config + +After following the prompts, a new ``.synapseConfig`` file and ``.synapseCache`` folder will be created in your home directory. You can view these hidden assets in your home directory with the following command: + +.. code-block:: shell + + ls -a ~ + +The ``.synapseConfig`` is used to log into Synapse if you are not using an environment variable (i.e. ``SYNAPSE_ACCESS_TOKEN``) for authentication, and the ``.synapseCache`` is where your assets are stored if you are not working with the CLI and/or you have specified ``.synapseCache`` as the location in which to store your manifests, in your ``config.yml``. + +**Create and modify the ``config.yml``** + +In this repository there is a ``config_example.yml`` file with default configurations to various components that are required before running ``schematic``, such as the Synapse ID of the main file view containing all your project assets, the + +Installation Guide For: Developers +---------------------------------- + +.. note:: + This section is for people developing on Schematic only + +The instructions below assume you have already installed `python `_, with the release version meeting the constraints set in the `Installation Requirements`_ section, and do not have an environment already active (e.g., with ``pyenv``). For development, we recommend working with versions > python 3.9 to avoid issues with ``pre-commit``'s default hook configuration. + +When contributing to this repository, please first discuss the change you wish to make via the `service desk `_ so that we may track these changes. + +Once you have finished setting up your development environment using the instructions below, please follow the guidelines in `CONTRIBUTION.md `_ during your development. + +Please note we have a `code of conduct `_, please follow it in all your interactions with the project. + +1. Clone the ``schematic`` package repository +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For development, you will be working with the latest version of ``schematic`` on the repository to ensure compatibility between its latest state and your changes. Ensure your current working directory is where you would like to store your local fork before running the following command: + +.. code-block:: shell + + git clone https://github.com/Sage-Bionetworks/schematic.git + +2. Install ``poetry`` +~~~~~~~~~~~~~~~~~~~~~ + +Install ``poetry`` (version 1.3.0 or later) using either the `official installer `_ or ``pip``. If you have an older installation of Poetry, we recommend uninstalling it first. + +.. code-block:: shell + + pip install poetry + +Check to make sure your version of poetry is > v1.3.0 + +.. code-block:: shell + + poetry --version + +3. Start the virtual environment +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Change directory (``cd``) into your cloned ``schematic`` repository, and initialize the virtual environment using the following command with ``poetry``: + +.. code-block:: shell + + poetry shell + +To make sure your poetry version and python version are consistent with the versions you expect, you can run the following command: + +.. code-block:: shell + + poetry debug info + +4. Install ``schematic`` dependencies +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Before you begin, make sure you are in the latest ``develop`` branch of the repository. + +The following command will install the dependencies based on what we specify in the ``poetry.lock`` file of this repository (which is generated from the libraries listed in the ``pyproject.toml`` file). If this step is taking a long time, try to go back to Step 2 and check your version of ``poetry``. Alternatively, you can try deleting the lock file and regenerate it by running ``poetry lock`` (Note: this method should be used as a last resort because it may force other developers to change their development environment). + +.. code-block:: shell + + poetry install --dev,doc + +This command will install: +- The main dependencies required for running the package. +- Development dependencies for testing, linting, and code formatting. +- Documentation dependencies such as ``sphinx`` for building and maintaining documentation. + +5. Set up configuration files +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following section will walk through setting up your configuration files with your credentials to allow for communication between ``schematic`` and the Synapse API. + +There are two main configuration files that need to be created and modified: +- ``.synapseConfig`` +- ``config.yml`` + +**Create and modify the ``.synapseConfig``** + +The ``.synapseConfig`` file is what enables communication between ``schematic`` and the Synapse API using your credentials. You can automatically generate a ``.synapseConfig`` file by running the following in your command line and following the prompts. + +.. tip:: + You can generate a new authentication token on the Synapse website by going to ``Account Settings`` > ``Personal Access Tokens``. + +.. code-block:: shell + + synapse config + +After following the prompts, a new ``.synapseConfig`` file and ``.synapseCache`` folder will be created in your home directory. You can view these hidden assets in your home directory with the following command: + +.. code-block:: shell + + ls -a ~ + +The ``.synapseConfig`` is used to log into Synapse if you are not using an environment variable (i.e., ``SYNAPSE_ACCESS_TOKEN``) for authentication, and the ``.synapseCache`` is where your assets are stored if you are not working with the CLI and/or you have specified ``.synapseCache`` as the location to store your manifests in your ``config.yml``. + +.. important:: + When developing on ``schematic``, keep your ``.synapseConfig`` in your current working directory to avoid authentication errors. + +**Create and modify the ``config.yml``** + +In this repository, there is a ``config_example.yml`` file with default configurations to various components required before running ``schematic``, such as the Synapse ID of the main file view containing all your project assets, the base name of your manifest files, etc. + +Copy the contents of the ``config_example.yml`` (located in the base directory of the cloned ``schematic`` repo) into a new file called ``config.yml``: + +.. code-block:: shell + + cp config_example.yml config.yml + +Once you've copied the file, modify its contents according to your use case. For example, if you wanted to change the folder where manifests are downloaded, your config should look like: + +.. code-block:: text + + manifest: + manifest_folder: "my_manifest_folder_path" + +.. important:: + Be sure to update your ``config.yml`` with the location of your ``.synapseConfig`` created in the step above to avoid authentication errors. Paths can be specified relative to the ``config.yml`` file or as absolute paths. + By default, the ``.synapseConfig`` file is created in your home directory, so as an example, the configuration file will have to contain `/full/path/to/.synapseConfig` as the path to the ``.synapseConfig`` file or be in the same + directory as the ``config.yml`` file. + +.. note:: + ``config.yml`` is ignored by git. + +6. Obtain Google credential files +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Any function that interacts with a Google Sheet (such as ``schematic manifest get``) requires Google Cloud credentials. + +1. **Option 1**: Follow the step-by-step `guide `_ on how to create these credentials in Google Cloud. + - Depending on your institution's policies, your institutional Google account may or may not have the required permissions to complete this. A possible workaround is to use a personal or temporary Google account. + +.. warning:: + At the time of writing, Sage Bionetworks employees do not have the appropriate permissions to create projects with their Sage Bionetworks Google accounts. You would follow instructions using a personal Google account. + +2. **Option 2**: Ask your DCC/development team if they have credentials previously set up with a service account. + +Once you have obtained credentials, ensure that the JSON file generated is named in the same way as the ``service_acct_creds`` parameter in your ``config.yml`` file. + +.. important:: + For testing, ensure there is no environment variable ``SCHEMATIC_SERVICE_ACCOUNT_CREDS``. Check the file ``.env`` to ensure this is not set. Also, verify that config files used for testing, such as ``config_example.yml``, do not contain ``service_acct_creds_synapse_id``. + +.. note:: + Running ``schematic init`` is no longer supported due to security concerns. To obtain ``schematic_service_account_creds.json``, please follow the `instructions `_. Schematic uses Google's API to generate Google Sheet templates that users fill in to provide (meta)data. + Most Google Sheet functionality could be authenticated with a service account. However, more complex Google Sheet functionality requires token-based authentication. As browser support that requires token-based authentication diminishes, we hope to deprecate token-based authentication and keep only service account authentication in the future. + +.. note:: + Use the ``schematic_service_account_creds.json`` file for the service account mode of authentication (*for Google services/APIs*). Service accounts are special Google accounts that can be used by applications to access Google APIs programmatically via OAuth2.0, with the advantage being that they do not require human authorization. + + +7. Verify your setup +~~~~~~~~~~~~~~~~~~~~ + +After running the steps above, your setup is complete, and you can test it in a ``python`` instance or by running a command based on the examples diff --git a/docs/source/linkml.rst b/docs/source/linkml.rst new file mode 100644 index 000000000..ef44d7314 --- /dev/null +++ b/docs/source/linkml.rst @@ -0,0 +1,18 @@ +====== +LinkML +====== + +Background +========== + +DPE is currently looking into what the future of Schematic might look like. This includes the possibility of completely reworking how we handle data models. Currently, Schematic supports data models in CSV or JSON-LD format. Several DCCs are either using or planning on using LinkML to create their data models and then port them to JsonLD for use in schematic. One possibility in Schematic 2.0 (placeholder name) is to make LinkML the format for data models and to use native LinkML functionality where possible to reduce the work that Schematic does. DPE is currently looking into what the future of Schematic might look like. This includes the possibility of completely reworking how we handle data models. Currently, Schematic supports data models in CSV or JSON-LD format. Several DCCs are either using or planning on using LinkML to create their data models and then port them to JsonLD for use in schematic. One possibility in Schematic 2.0 (placeholder name) is to make LinkML the format for data models and to use native LinkML functionality where possible to reduce the work that Schematic does. + +Links +===== +LinkML `documentation `_ + +DPE performed a `comparison `_ between Schematics current functionality and what LinkML could provide via the CLI. This is currently restricted to Sage Bionetworks staff. + +The `HTAN2 model `_ in LinkML + +The `NF Repo `_ that creates the LinkML model. diff --git a/docs/source/manifest_generation.rst b/docs/source/manifest_generation.rst new file mode 100644 index 000000000..560776d5e --- /dev/null +++ b/docs/source/manifest_generation.rst @@ -0,0 +1,248 @@ +.. _manifest_generation: + +Generate a manifest +=================== +A **manifest** is a structured file containing metadata that adheres to a specific data model. This page covers different ways to generate a manifest. + +Prerequisites +------------- + +**Before Using the Schematic CLI** + +- **Install and Configure Schematic**: + Ensure you have installed `schematic` and set up its dependencies. + See the :ref:`installation` section for more details. + +- **Understand Important Concepts**: + Understand Important Concepts: Familiarize yourself with key concepts outlined on the :ref:`index` of the documentation. + +- **Configuration File**: + Learn more about each attribute in the configuration file by referring to the relevant documentation. + + +**Using the Schematic API in Production** + +Visit the **Schematic API (Production Environment)**: +``_ + +This will open the **Swagger UI**, where you can explore all available API endpoints. + +Run help command +---------------- + +You could run the following commands to learn about subcommands with manifest generation: + +.. code-block:: bash + + schematic manifest -h + +You could also run the following commands to learn about all the options with manifest generation: + +.. code-block:: bash + + schematic manifest --config path/to/config.yml get -h + + +Generate an empty manifest +--------------------------- + +Option 1: Use the CLI +~~~~~~~~~~~~~~~~~~~~~ + +You can generate a manifest by running the following command: + + .. code-block:: bash + + schematic manifest -c /path/to/config.yml get -dt -s + + - **-c /path/to/config.yml**: Specifies the configuration file containing your data model location. + - **-dt **: Defines the data type for the manifest (e.g., `"Patient"`, `"Biospecimen"`). + - **-s**: Generates a manifest as a Google Sheet. + +If you want to generate a manifest as an excel spreadsheet, you could do: + +.. code-block:: bash + + schematic manifest -c /path/to/config.yml get -dt --output-xlsx + +And if you want to generate a manifest as a csv file, you could do: + +.. code-block:: bash + + schematic manifest -c /path/to/config.yml get -dt --output-csv + +Option 2: Use the API +~~~~~~~~~~~~~~~~~~~~~ + +1. Visit the `manifest/generate endpoint `_. +2. Click "Try it out" to enable input fields. +3. Enter the following parameters and execute the request: + + - **schema_url**: The URL of your data model. + - If your data model is hosted on **GitHub**, the URL should follow this format: + - JSON-LD: `https://raw.githubusercontent.com//data-model.jsonld` + - CSV: `https://raw.githubusercontent.com//data-model.csv` + + - **data_type**: The data type or schema model for your manifest (e.g., `"Patient"`, `"Biospecimen"`). + - You can specify multiple data types or enter `"all manifests"` to generate manifests for all available data types. + + - **output_format**: The desired format for the generated manifest. Options include `"excel"` or `"google_sheet"`. + +This will generate a manifest directly from the API. + + +Generate a manifest using a dataset on synapse +---------------------------------------------- + +Option 1: Use the CLI +~~~~~~~~~~~~~~~~~~~~~~ + +.. note:: + + See the :ref:`installation` section for more details to obtain synapse credentials and set up synapse configuration file. + + +The **top-level dataset** can be either an empty folder or a folder containing files. + +See below as an example of a top-level dataset: + +.. code-block:: text + + syn12345678/ + ├── sample1.fastq + ├── sample2.fastq + └── sample3.fastq + +Here you should use syn12345678 to generate a manifest + +See another example of a top-level dataset with subfolders: + +.. code-block:: text + + syn12345678/ + └── subfolder1/ + ├── sample1.fastq + └── sample2.fastq + └── subfolder2/ + ├── sample3.fastq + └── sample4.fastq + +Here you should use syn12345678 to generate a manifest + + +.. code-block:: bash + + schematic manifest -c /path/to/config.yml get -dt -s -d + +- **-c /path/to/config.yml**: Specifies the configuration file containing the data model location and asset view (`master_fileview_id`). +- **-dt **: Defines the data type/schema model for the manifest (e.g., `"Patient"`, `"Biospecimen"`). +- **-d **: Retrieves the existing manifest associated with a specific dataset on Synpase. + + +Option 2: Use the API +~~~~~~~~~~~~~~~~~~~~~~ + +To generate a manifest using the **Schematic API**, follow these steps: + +1. Visit the `manifest/generate endpoint `_. +2. Click **"Try it out"** to enable input fields. +3. Enter the required parameters and execute the request: + + - **schema_url**: The URL of your data model. + - If your data model is hosted on **GitHub**, the URL should follow this format: + - JSON-LD: `https://raw.githubusercontent.com//data-model.jsonld` + - CSV: `https://raw.githubusercontent.com//data-model.csv` + + - **output_format**: The desired format for the generated manifest. + - Options include `"excel"` or `"google_sheet"`. + + - **data_type**: The data type or schema model for your manifest (e.g., `"Patient"`, `"Biospecimen"`). + - You can specify multiple data types or enter `"all manifests"` to generate manifests for all available data types. + + - **dataset_id**: The **top-level Synapse dataset ID**. + - This can be a **Synapse Project ID** or a **Folder ID**. + + - **asset_view**: The **Synapse ID of the fileview** containing the top-level dataset for which you want to generate a manifest. + +Generate a manifest using a dataset on synapse and pull annotations +-------------------------------------------------------------------- + +.. note:: + When you pull annotations from Synapse, the existing metadata (annotations) associated with files or folders in a Synapse dataset is automatically retrieved and pre-filled into the generated manifest. + This saves time and ensures consistency between the Synapse dataset and the manifest. + + See below as an example: + + .. code-block:: text + + syn12345678/ + ├── file1.txt + ├── file2.txt + └── file3.txt + + The corresponding annotations might look like this: + + - **file1.txt** + - Annotation Key: `species` + - Annotation Value: `test1` + + - **file2.txt** + - Annotation Key: `species` + - Annotation Value: `test2` + + - **file3.txt** + - Annotation Key: `species` + - Annotation Value: `test3` + + The generated manifest will include the above annotations pulled from Synapse when enabled. + + +Option 1: Use the CLI +~~~~~~~~~~~~~~~~~~~~~~ + +.. note:: + + Ensure your **Synapse credentials** are configured before running the command. + You can obtain a **personal access token** from Synapse by following the instructions here: + ``_ + + +The **top-level dataset** can be either an empty folder or a folder containing files. + + .. code-block:: bash + + schematic manifest -c /path/to/config.yml get -dt -s -d -a + + - **-c /path/to/config.yml**: Specifies the configuration file containing the data model location and asset view (`master_fileview_id`). + - **-a**: Pulls annotations from Synapse and fills out the manifest with the annotations. + - **-dt **: Defines the data type/schema model for the manifest (e.g., `"Patient"`, `"Biospecimen"`). + - **-d **: Retrieves the existing manifest associated with a specific dataset on Synpase. + + +Option 2: Use the API +~~~~~~~~~~~~~~~~~~~~~~ + +To generate a manifest using the **Schematic API**, follow these steps: + +1. Visit the `manifest/generate endpoint `_. +2. Click **"Try it out"** to enable input fields. +3. Enter the required parameters and execute the request: + + - **schema_url**: The URL of your data model. + - If your data model is hosted on **GitHub**, the URL should follow this format: + - JSON-LD: `https://raw.githubusercontent.com//data-model.jsonld` + - CSV: `https://raw.githubusercontent.com//data-model.csv` + + - **output_format**: The desired format for the generated manifest. + - Options include `"excel"` or `"google_sheet"`. + + - **data_type**: The data type or schema model for your manifest (e.g., `"Patient"`, `"Biospecimen"`). + - You can specify multiple data types or enter `"all manifests"` to generate manifests for all available data types. + + - **dataset_id**: The **top-level Synapse dataset ID**. + - This can be a **Synapse Project ID** or a **Folder ID**. + + - **asset_view**: The **Synapse ID of the fileview** containing the top-level dataset for which you want to generate a manifest. + + - **use_annotations**: A boolean value that determines whether to pull annotations from Synapse and fill out the manifest with the annotations. + - Set this value to `true` to pull annotations. diff --git a/docs/source/manifest_submission.rst b/docs/source/manifest_submission.rst new file mode 100644 index 000000000..441f28119 --- /dev/null +++ b/docs/source/manifest_submission.rst @@ -0,0 +1,314 @@ +Submit a manifest to Synapse +============================ + +Prerequisites +------------- + +**Obtain Synapse Credentials**: +Ensure you have a Synapse account and set up Synapse configuration file correctly. See the :ref:`installation` section for more details. + +**Before Using the Schematic CLI** + +- **Install and Configure Schematic**: + Ensure you have installed `schematic` and set up its dependencies. + See the :ref:`installation` section for more details. + +- **Understand Important Concepts**: + Familiarize yourself with key concepts outlined on the :ref:`index` of the documentation. + +- **Configuration File**: + For more details on configuring Schematic, refer to the :ref:`configure schematic` section. + +- **Obtain a manifest**: + Please obtain a manifest by following the documentation of generating a manifest. + + +**Using the Schematic API in Production** + +Visit the **Schematic API (Production Environment)**: +``_ + +This will open the **Swagger UI**, where you can explore all available API endpoints. + + +Run help command +---------------- + +You could run the following commands to learn about subcommands with manifest submission: + +.. code-block:: bash + + schematic model -h + +You could also run the following commands to learn about all the options with manifest submission: + +.. code-block:: bash + + schematic model --config path/to/config.yml submit -h + + +Submit a Manifest File to Synapse +--------------------------------- + +.. note:: + + You can configure the format of the manifest being submitted by using the `-mrt flag` in the CLI or the `manifest_record_type` in the API. + + For table column names, here's a brief explanation of all the options: + - display_name: use raw display name defined in the data model as the column name, no modifications to the name will be made. + - display_label: use the display name formatting as the column name. Will strip blacklisted characters (including spaces) when present. + The blacklisted characters are: "(", ")", ".", " ", "-" + - class_label: default, use standard class label and strip any blacklisted characters (including spaces) when present. A schematic class label is UpperCamelCase. + +.. note:: + + Manifests should be submitted to the top-level dataset folder. Below are some examples demonstrating where the manifest file should go: + + .. code-block:: text + + syn12345678/ + ├── file1.csv + ├── file2.csv + ├── manifest.csv + + Here is the top-level folder ID: syn12345678 + + Here's an example using subfolders: + + .. code-block:: text + + syn12345678/ + ├── subfolder1/ + │ └── file1 + ├── subfolder2/ + │ └── file2 + ├── file3 + ├── manifest.csv + + Here is the top-level folder ID: syn12345678 + + +Option 1: Use the CLI +~~~~~~~~~~~~~~~~~~~~~~ + +.. note:: + + During submission, validation is optional. If you have finished validation in previous step, you could skip validation by removing `-vc ` + + +.. code-block:: bash + + schematic model -c /path/to/config.yml submit -mp -d -vc -mrt table_and_file -no-fa -tcn "class_label" + +- **-c /path/to/config.yml**: Specifies the configuration file containing the data model location and asset view (`master_fileview_id`). +- **-mp**: Your manifest file path. +- **-mrt**: The format of manifest submission. The options are: "table_and_file", "file_only", "file_and_entities", "table_file_and_entities". "file_only" option would submit the manifest as a file. +- **-vc **: Defines the data type/schema model for the manifest (e.g., `"Patient"`, `"Biospecimen"`). To skip validation, remove this flag. +- **-d **: the top level dataset id that you want to submit the manifest to. +- **-no-fa**: Skips the file annotations upload. +- **-tcn**: Table Column Names: This is optional, and the available options are "class_label", "display_label", and "display_name". The default is "class_label", but you can change it based on your requirements. + + +Option 2: Use the API +~~~~~~~~~~~~~~~~~~~~~~ + +.. note:: + + During submission, validation is optional. If you have finished validation in previous step, you could skip validation by excluding the `data_type` and `dataset_scope` parameter values. + + +1. Visit the `**model/submit** endpoint `_ +2. Click **"Try it out"** to enable input fields. +3. Enter the required parameters and execute the request: + + - **schema_url**: The raw URL of your data model. If your data model is hosted on **GitHub**, use the following formats: + - JSON-LD: `https://raw.githubusercontent.com//data-model.jsonld` + - CSV: `https://raw.githubusercontent.com//data-model.csv` + + - **data_type**: Specify the data type or schema model for your manifest (e.g., `"Patient"`, `"Biospecimen"`). To skip validation, exclude this parameter by removing the default inputs. + + - **dataset_id**: Provide the **top-level Synapse dataset ID**. + - This can be either a **Synapse Project ID** or a **Folder ID**. + + - **asset_view**: Enter the **Synapse ID of the fileview** containing the top-level dataset for which you want to generate a manifest. + + - **dataset_scope** and **project_scope**: Remove the default inputs. + + - **file_annotations_upload**: Set this to `False`. + + - **table_manipulation**: The default is "replace". You can keep it as is. + + - **manifest_record_type**: Set this to "table_and_file" or adjust it based on your project requirements. + + - **table_column_names**: This is optional. Available options are "class_label", "display_label", and "display_name". The default is "class_label". + + + +Submit a Manifest file and Add Annotations +------------------------------------------- + +.. note:: + + Since annotations are enabled in the submission, if you are submitting a file-based manifest, you should see annotations attached to the entity IDs listed in the manifest. + + + +Option 1: Use the CLI +~~~~~~~~~~~~~~~~~~~~~~ + + +.. note:: + + During submission, validation is optional. If you have finished validation in previous step, you could skip validation by removing `-vc ` + + +.. code-block:: bash + + schematic model -c /path/to/config.yml submit -mp -d -vc -mrt table_and_file -fa -tcn "class_label" + +- **-c /path/to/config.yml**: Specifies the configuration file containing the data model location and asset view (`master_fileview_id`). +- **-mp**: Your manifest file path. +- **-mrt**: The format of manifest submission. The options are: "table_and_file", "file_only", "file_and_entities", "table_file_and_entities". "file_only" option would submit the manifest as a file. +- **-vc **: Defines the data type/schema model for the manifest (e.g., `"Patient"`, `"Biospecimen"`). To skip validation, remove this flag. +- **-d **: the top level dataset id that you want to submit the manifest to. +- **-fa**: Enable file annotations upload. +- **-tcn**: Table Column Names: This is optional, and the available options are "class_label", "display_label", and "display_name". The default is "class_label", but you can change it based on your requirements. + + +Option 2: Use the API +~~~~~~~~~~~~~~~~~~~~~~ + +.. note:: + + During submission, validation is optional. If you have finished validation in previous step, you could skip validation by excluding the `data_type` and `dataset_scope` parameter values. + + +1. Visit the `**model/submit** endpoint `_ +2. Click **"Try it out"** to enable input fields. +3. Enter the required parameters and execute the request: + + - **schema_url**: The raw URL of your data model. If your data model is hosted on **GitHub**, the URL should follow this format: + - JSON-LD: `https://raw.githubusercontent.com//data-model.jsonld` + - CSV: `https://raw.githubusercontent.com//data-model.csv` + + - **data_type**: Specify the data type or schema model for your manifest (e.g., `"Patient"`, `"Biospecimen"`). To skip validation, exclude this parameter by removing the default inputs. + + - **dataset_id**: The **top-level Synapse dataset ID**. + - This can be a **Synapse Project ID** or a **Folder ID**. + + - **asset_view**: The **Synapse ID of the fileview** containing the top-level dataset for which you want to generate a manifest. + + - **dataset_scope** and **project_scope**: Remove any default inputs provided in these fields. + + - **file_annotations_upload**: Set this to `True`. + + - **table_manipulation**: The default is "replace". You can keep it as is or modify it if needed. + + - **manifest_record_type**: Set this to "table_and_file" or adjust it based on your project requirements. + + - **table_column_names**: This is optional. Available options are "class_label", "display_label", and "display_name". The default is "class_label". + + + +Expedite submission process (Optional) +--------------------------------------- + +If your asset view contains multiple projects, it might take some time for the submission to finish. + +You could expedite the submission process by specifying the project_scope parameter. This parameter allows you to specify the project(s) that you want to submit the manifest to. + +To utilize this parameter, make sure that the projects listed there are part of the asset view. + + +Option 1: Use the CLI +~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: bash + + schematic model -c /path/to/config.yml submit -mp -d -vc -no-fa -ps "project_id1, project_id2" + +- **-ps**: Specifies the project scope as a comma separated list of project IDs. + + +Option 2: Use the API +~~~~~~~~~~~~~~~~~~~~~~ + +1. Visit the `**model/submit** endpoint `_ +2. Click **"Try it out"** to enable input fields. +3. Enter the required parameters and execute the request: + + - **schema_url**: The raw URL of your data model. If your data model is hosted on **GitHub**, use the following formats: + - JSON-LD: `https://raw.githubusercontent.com//data-model.jsonld` + - CSV: `https://raw.githubusercontent.com//data-model.csv` + + - **data_type**: Specify the data type or schema model for your manifest (e.g., `"Patient"`, `"Biospecimen"`). To skip validation, exclude this parameter by removing the default inputs. + + - **dataset_id**: Provide the **top-level Synapse dataset ID**. + - This can be either a **Synapse Project ID** or a **Folder ID**. + + - **asset_view**: Enter the **Synapse ID of the fileview** containing the top-level dataset for which you want to generate a manifest. + + - **project_scope**: Remove the default inputs. Add project IDs as string items. + + - **dataset_scope**: Remove default inputs. + + - **file_annotations_upload**: Set this to `false`. + + - **table_manipulation**: The default is "replace". You can keep it as is. + + - **manifest_record_type**: Set this to "file_only" or adjust it based on your project requirements. + + - **table_column_names**: This parameter is not applicable when uploading a manifest as a file. You can keep it as is and it will be ignored. + + +Enable upsert for manifest submission +------------------------------------- + +By default, the CLI/API will replace the existing manifest and table with the new one. If you want to update the existing manifest and table, you could use the upsert option. + + +Pre-requisite +~~~~~~~~~~~~~~ + +1. Ensure that all your manifests, including both the initial manifests and those containing rows to be upserted, include a primary key: . For example, if your component name is "Patient", the primary key should be "Patient_id". +2. If you plan to use upsert in the future, select the upsert option during the initial table uploads. +3. Currently it is required to use -tcn "display_label" with table upserts. + + +Option 1: Use the CLI +~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: bash + + schematic model -c /path/to/config.yml submit -mp -d -mrt table_and_file -no-fa -tcn "display_label" -tm "upsert" + +- **-tm**: The default option is "replace". Change it to "upsert" for enabling upsert. +- **-tcn**: Use display label for upsert. + +Option 2: Use the API +~~~~~~~~~~~~~~~~~~~~~~ + +1. Visit the `**model/submit** endpoint `_ +2. Click **"Try it out"** to enable input fields. +3. Enter the required parameters and execute the request: + + - **schema_url**: The raw URL of your data model. If your data model is hosted on **GitHub**, use the following formats: + - JSON-LD: `https://raw.githubusercontent.com//data-model.jsonld` + - CSV: `https://raw.githubusercontent.com//data-model.csv` + + - **data_type**: Specify the data type or schema model for your manifest (e.g., `"Patient"`, `"Biospecimen"`). To skip validation, exclude this parameter by removing the default inputs. + + - **dataset_id**: Provide the **top-level Synapse dataset ID**. + - This can be either a **Synapse Project ID** or a **Folder ID**. + + - **asset_view**: Enter the **Synapse ID of the fileview** containing the top-level dataset for which you want to generate a manifest. + + - **dataset_scope** and **project_scope**: Remove the default inputs. + + - **file_annotations_upload**: Set this to `False` if you do not want annotations to be uploaded. + + - **table_manipulation**: Update it to "upsert". + + - **manifest_record_type**: Set this to **"table_and_file"** + + - **table_column_names**: Choose **"display_label"** for upsert. diff --git a/docs/source/manifest_validation.rst b/docs/source/manifest_validation.rst new file mode 100644 index 000000000..7c1abc16f --- /dev/null +++ b/docs/source/manifest_validation.rst @@ -0,0 +1,220 @@ +.. _Validating a Metadata Manifest: + +Validating a Metadata Manifest +================================================= + +Prerequisites +------------- + +**Obtain Synapse Credentials**: +Ensure you have a Synapse account and set up Synapse configuration file correctly. See the :ref:`installation` section for more details. + +**Before Using the Schematic CLI** + +- **Install and Configure Schematic**: + Ensure you have installed `schematic` and set up its dependencies. + See the :ref:`installation` section for more details. + +- **Understand Important Concepts**: + Familiarize yourself with key concepts outlined on the :ref:`index` of the documentation. + +- **Configuration File**: + For more details on configuring Schematic, refer to the documentation on :ref:`creating a configuration file for schematic `. + +- **Obtain a manifest**: + Please obtain a manifest by following the documentation of :ref:`generating a manifest `. + + +**Using the Schematic API in Production** + +Visit the **Schematic API (Production Environment)**: +``_ + +This will open the **Swagger UI**, where you can explore all available API endpoints. + + +Requirements +------------------------------------------------- + +Authentication +~~~~~~~~~~~~~~~~~~~~ +Authentication with Synapse is required for metadata validation that includes Cross Manifest Validation rules or the ``filenameExists`` rule. + +File Format +~~~~~~~~~~~~~~ +In general, metadata manifests must be stored as ``.CSV`` files. When validating through the api, manifests may alternatively be sent as a JSON string. + +Required Column Headers +~~~~~~~~~~~~~~~~~~~~~~~~~ +A ``Component`` column that specifies the data type of the metadata must be present in the manifest. Additionally, columns must be present for each attribute in the component that you wish to validate. + +Restricted Column Headers +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The columns ``Filename``, ``entityId``, and ``Component`` are reserved for use by schematic and should not be used as other attributes in a data model. + + +Manifest Validation +------------------------------------------------- +Overview +~~~~~~~~~ +Invalidities within a manifest’s metadata are classified as either errors or warnings depending on the rule itself, whether the attribute is required, and what the data modeler has specified. +Errors are considered serious invalidities that must be corrected before submission. Warnings are considered less serious invalidities that are acceptable. +A manifest with errors should not be submitted and the presence of errors found during submission will block submission. The presence of warnings will not block submission. + +.. note:: + Validation Can be performed as its own, separate step or during submission, by including the ``-vc`` parameter and the data type of the metadata to validate + + +Separately: + +.. code-block:: bash + + schematic model -c /path/to/config.yml validate -dt -mp + + +or with the `/model/validate `_ endpoint. + +During submission: + +.. code-block:: bash + + schematic model -c /path/to/config.yml submit -mp -d -vc -mrt file_only + + +or by specifying a value for the ``data_type`` parameter in the `/model/submit `_ endpoint. + +If you need further assistance, help is available by running the following command: + +.. code-block:: bash + + schematic model -c /path/to/config.yml validate -h + +or by viewing the parameter descriptions under the endpoints linked above. + + +With the CLI +~~~~~~~~~~~~~~~ + +Authentication +^^^^^^^^^^^^^^^^ +To authenticate for use with the CLI, follow the installation guide instructions on how to :ref:`set up configuration files ` + +Parameters +^^^^^^^^^^^^^^^ +--manifest_path/-mp + string + + Specify the path to the metadata manifest file that you want to submit to a dataset on Synapse. This is a required argument. + +--data_type/-dt + optinal string + + Data type of the metadata to be vaidated + + Specify the component (data type) from the data model that is to be used for validating the metadata manifest file. You can either explicitly pass the data type here or provide it in the ``config.yml`` file as a value for the ``(manifest > data_type)`` key. + +--json_schema/-js + optional string + + Specify the path to the JSON Validation Schema for this argument. You can either explicitly pass the ``.json`` file here or provide it in the ``config.yml`` file as a value for the ``(model > input > validation_schema)`` key. + +--restrict_rules/-rr + boolean flag + + If flag is provided when command line utility is executed, validation suite will only run with in-house validation rules, and Great Expectations rules and suite will not be utilized. If not, the Great Expectations suite will be utilized and all rules will be available. + +--project_scope/-ps + optional string + + Specify a comma-separated list of projects to search through for cross manifest validation. Used to speed up some interactions with synapse. + +--dataset_scope/-ds + string + + Specify a dataset to validate against for filename validation. + +--data_model_labels/-dml + string + + one of: + + * class_label - use standard class or property label + * display_label - use display names (values given in the CSV data model, or the names designated as the display name field of the JSONLD data model) as label. Requires there to be no blacklisted characters in the label + + default: class_label + + .. warning:: + Do not change from default unless there is a real need, using 'display_label' can have consequences if not used properly. + +The SynId of the fileview containing all relevant project assets should also be specifed in the ``config.yml`` file under ``(asset_store > synapse > master_fileview_id)`` + + +With the API +~~~~~~~~~~~~~~~ + +Authentication +^^^^^^^^^^^^^^^^ +Your Synapse token should be included the in the request headers under the ``access_token`` key. In the SwaggerUI this can be added by clicking the padlock icon at the top right or next to the endoints that accept it. + +Parameters +^^^^^^^^^^^^^^^ +schema_url + string + url to the raw version of the data model in either ``.CSV`` or ``.JSONLD`` formats + +data_type + string + Data type of the metadata to be vaidated + +data_model_labels + string + one of: + + * class_label - use standard class or property label + * display_label - use display names (values given in the CSV data model, or the names designated as the display name field of the JSONLD data model) as label. Requires there to be no blacklisted characters in the label + + default: class_label + + .. warning:: + Do not change from default unless there is a real need, using 'display_label' can have consequences if not used properly. + +restrict_rules + boolean + If True, validation suite will only run with in-house validation rule. If False, the Great Expectations suite will be utilized and all rules will be available. + +json_str + string + optional + The metadata manifest in the form of a JSON string. + +asset_view + string + SynId of the fileview containing all project assets + +project_scope + optional array[string] + list of SynIds of projects that are relevant for the current operation. Used to speed up some interactions with Synapse. + +dataset_scope + string + Specify a dataset to validate against for filename validation. + +Request Body +^^^^^^^^^^^^^ +file_name + string($binary) + + ``.CSV`` or ``.JSON`` file of the metadata manifest + + +Response +^^^^^^^^^^^ +If valiation completes successfully, regardless of the presence of validation errors or warnings, you'll recieve a ``200`` response code. +The body will be a JSON string containing a list of valiation errors and warnings in the format of ``{"errors": [list of errors], "warnings": [warnings]}`` + +Validating though the CLI will display all the errors and warnings found during validation or a message that no errors or warnings were found and the manifest is considered valid. + + +With the Library +~~~~~~~~~~~~~~~~~ +TODO diff --git a/docs/source/troubleshooting.rst b/docs/source/troubleshooting.rst new file mode 100644 index 000000000..ec8f33a4f --- /dev/null +++ b/docs/source/troubleshooting.rst @@ -0,0 +1,195 @@ +Troubleshooting +=============== + +These are some common issues you may encounter when using schematic + +Debugging +--------- +Whether you are using DCA or schematic API or schematic library/CLI, the following are some steps that you want to take to debug your issues. Here are some steps to walk you through the process. + +1. What was the command that caused the error? +2. Is the error listed down below? +3. Did you follow the workflow outlined in the tutorials section under: "Contributing your manifest with the CLI"? + + 1. If you are validating or submitting the manifest, how was the manifest initiatially generated? If manually and NOT using schematic, there may be errors. + 2. If the manifest was generated by schematic, when was it generated? Did you download the previously submitted manifest from Synapse and modify it? Did you download it and resubmit it? Please run the manifest generate command again to have a fresh manifest. +4. Does your data model have any "reserved words" as attribute names? + + The following are reserved words that should not be used as attribute names in your data model: + + - **id** (any case variation): + - When submitting a manifest, schematic automatically adds the "Id" column to the manifest. If you have "id" (any case variation) in your model, it could potentially cause a problem or confusion. + - Avoid mapping the "display name" of "Id" to "id" (lowercase), as "id" is a reserved internal key for Synapse and cannot be used as an annotation key. + + - **entityId** (any case variation): + - The `entityId` column in a manifest refers to the Synapse ID of the file that a particular row of metadata describes. + - It ensures your metadata is attached to the correct file in Synapse. + - When generating a manifest, schematic automatically adds the `entityId` column to the manifest to ensure that the metadata is attached to the correct file in Synapse during the submission step. + - When submitting or updating metadata, schematic uses `entityId` to know where the annotations should go. + + - **eTag** (any case variation): + - The `eTag` is a version identifier for a file in Synapse. It helps ensure that metadata is being applied to the correct version of an entity. + - When submitting or updating metadata, schematic automatically adds the `eTag` column to the manifest. + + Please also note that the following are reserved words for **Synapse table columns**. Any variations of the following would cause a conflict with Synapse table columns: + + - **ROW_ID** (any case variation): + - `row_id` + - `RowID` + - `ROW ID` (contains a space) + - ` row_id ` (contains leading/trailing spaces) + + - **ROW_VERSION** (any case variation): + - `row_version` + - `RowVersion` + - `ROW VERSION` (contains a space) + - ` row_version ` (contains leading/trailing spaces) + + - **ROW_ETAG** (any case variation): + - `row_etag` + - `RowETag` + - `ROW ETAG` (contains a space) + - ` row_etag ` (contains leading/trailing spaces) + + - **ROW_BENEFACTOR** (any case variation): + - `row_benefactor` + - `RowBenefactor` + - `ROW BENEFACTOR` (contains a space) + - ` row_benefactor ` (contains leading/trailing spaces) + + - **ROW_SEARCH_CONTENT** (any case variation): + - `row_search_content` + - `RowSearchContent` + - `ROW SEARCH CONTENT` (contains spaces) + - ` row_search_content ` (contains leading/trailing spaces) + + - **ROW_HASH_CODE** (any case variation): + - `row_hash_code` + - `RowHashCode` + - `ROW HASH CODE` (contains spaces) + - ` row_hash_code ` (contains leading/trailing spaces) + + The following are reserved words for **Synapse annotations**. Variations of these words could potentially work (except id and etag which are already reserved words for schematic), but it is recommended to avoid them altogether. For more details, refer to the Synapse REST API documentation: `EntityView `__. + + - **name**: The name of this entity. Must be 256 characters or less. Names may only contain: letters, numbers, spaces, underscores, hyphens, periods, plus signs, apostrophes, and parentheses. + - **description**: The description of this entity. Must be 1000 characters or less. + - **id**: The unique immutable ID for this entity. A new ID will be generated for new Entities. Once issued, this ID is guaranteed to never change or be re-issued. + - **etag**: Synapse employs an Optimistic Concurrency Control (OCC) scheme to handle concurrent updates. Since the E-Tag changes every time an entity is updated it is used to detect when a client's current representation of an entity is out-of-date. + - **createdOn**: The timestamp when the entity was created. + - **modifiedOn**: The timestamp when the entity was last modified. + - **createdBy**: The ID of the user who created this entity. + - **modifiedBy**: The ID of the user who last modified this entity. + - **parentId**: The ID of the Entity that is the parent of this Entity. + - **concreteType**: Indicates which implementation of Entity this object represents. The value is the fully qualified class name, e.g., `org.sagebionetworks.repo.model.FileEntity`. + - **versionNumber**: The version number issued to this version of the object. + - **versionLabel**: The version label for this entity. + - **versionComment**: The version comment for this entity. + - **isLatestVersion**: A boolean indicating if this is the latest version of the object. + - **columnIds**: An array of ColumnModel IDs that define the schema of the object. + - **isSearchEnabled**: A boolean specifying if full-text search is enabled. Note that enabling full-text search might slow down the indexing of the table or view. + - **viewTypeMask**: A bitmask representing the types to include in the view. + - **type**: Deprecated. Use `viewTypeMask` instead. + - **scopeIds**: The list of IDs defining the scope of the view. + + The following also have special meaning to schematic. Misusing these terms in your data model could lead to errors or unexpected behavior. Please read carefully before using them in your data model: + + - **Filename**: + For data types that are stored in data files, the attribute `Filename` is used to denote the file name of each file in a dataset. + If `Filename` is not included in the data type schema attributes, schematic interprets the data type as “tabular” (e.g., clinical, biospecimen data). + + - **Component**: + The `Component` field in schematic is used to define higher-level groupings of attributes. + - For example, a Patient might be described by components such as Demographics, Family History, Diagnosis, and Therapy, each with its own set of attributes and corresponding manifest. + - Schematic allows declaration of "components" and relationships between components. + - Schematic also enables validation and tracking of components across related entities (e.g., ensuring that all parts of a Patient record are present). + +5. Create a Github issue or reach out to your respective DCC service desks. What is the schematic or DCA configuration used? Specifically, it's most important to capture the following: + + 1. `data_type`: This is the same as Component in the data model. + 2. `master_fileview_id`: This is the Synapse ID of the file view listing all project data. + 3. `data model url`: This is the link to your data model. + 4. `dataset_id`: This is the "top level folder" (folder annoated with contentType: Datatset). + 5. What is the command or API call that you made? If you are using DCA, please provide the step at which you encountered the error (manifest generate, validate, submit, etc) + + .. code-block:: bash + + schematic manifest -c /path/to/config.yml get -dt -s + # OR (PLEASE REDACT YOUR BEARER TOKEN) + curl -X 'GET' \ + 'https://schematic.api.sagebionetworks.org/v1/manifest/generate?schema_url=https%3A%2F%2Fraw.githubusercontent.com%2Fnf-osi%2Fnf-metadata-dictionary%2Fv9.8.0%2FNF.jsonld&title=Example&data_type=EpigeneticsAssayTemplate&use_annotations=true&dataset_id=syn63305821&asset_view=syn16858331&output_format=google_sheet&strict_validation=true&data_model_labels=class_label' \ + -H 'accept: application/json' ... + + +Manifest Submit: `RuntimeError: failed with SynapseHTTPError('400 Client Error: nan is not a valid Synapse ID.')` +----------------------------------------------------------------------------------------------------------------- + +As for 24.10.2 version of Schematic, we require the `Filename` column to have the full paths to the file on Synapse including the project name. +You will encounter this issue if you try an submit a manifest with wrong filenames. For example, if your file in your project has this full path +`my_project/my_folder/my_file.txt`, you will get this error by: + +* not containing full path (e.g. `my_file.txt`) +* Wrong filename (e.g. `my_project/my_folder/wrong_file_name.txt`) +* Wrong filepath (e.g. `my_project/wrong_folder/my_file.txt`) + +This is because we join the `Filename` column together with what's in Synapse to append the `entityId` column if it's missing. + +To fix: You will want to first check if your "Top Level Folder" has a manifest with invalid Filename values in the column. +If so, please generate a manifest with schematic which should fix the Filenames OR (the less preferred solution) manually update the Filenames to include the full path to the file and manually upload. + + +Manifest Submit: `TypeError: boolean value of NA is ambiguous` +-------------------------------------------------------------- + +You may encounter this error if your manifest has a Component column but it is empty. This may occur if the manifest in your "Top Level Folder" +does not contain this column. During manifest generate, it will create an empty column for you. + +To fix: Check if your manifest has an empty Component column. Please fill out this column with the correct Component values and submit the manifest again. + + +Manifest Submit: `AssertionError: input_df lacks Id column.` +-------------------------------------------------------------- + +You may encounter this error if your manifest has an "id" (lower case) column during submission. + +To fix: Delete the `id` (any case variation) and `eTag` column (any case variation) from your manifest and submit the manifest again. + + +Manifest validation: `The submitted metadata does not contain all required column(s)` +------------------------------------------------------------------------------------- + +The required columns are determined by the data model, but `Component` should be a required column even if it's not set that way in the data model. +This is the validation error you may get if you don't have the `Component` column. + +To fix: Check if your manifest has a Component column or missing other required columns. Please add the `Component` column (and fill it out) or any other required columns. + + +Manifest validation: `The submitted metadata contains << 'string' >> in the Component column, but requested validation for << expected string >>` +------------------------------------------------------------------------------------------------------------------------------------------------- + +If the manifest has incorrect Component values, you might get the validation error message above. This is because the Component value is incorrect, +and the validation rule uses the "display" value of what's expected in the Component column. For example, the display name could be "Imaging Assay" +but the actual Component name is "ImagingAssayTemplate". + +To fix: Check if your manifest has invalid Component values and fill it out correctly. Using the above example, fill out your Component column with "ImagingAssayTemplate" + + +Manifest Generate: `KeyError: entityId` +--------------------------------------- + +Fixed: v24.12.1 + +If there is currently a manifest in your "Top Level Folder" on Synapse with an incorrect Filename BUT entityId column. +You will be able to run manifest generate to create a new manifest with the new Filenames. However, If this manifest on Synapse does +NOT have the entityId column you will encounter that error. + +To fix: You will want to first check if your "Top Level Folder" has a manifest without the entityId column. +If so, you can either submit your manifest using schematic OR (the less preferred solution) manually add the entityId column to the manifest on Synapse. + +Manifest Generate: `ValueError: cannot insert eTag, already exists` +------------------------------------------------------------------- + +Fixed: v24.11.2 + +If you do NOT have a manifest in your "Top Level Folder" on Synapse and your File entities in this folder are annotated with 'eTag' key and you try to generate a manifest, it will fail. + +To fix: This should be fixed in schematic, but for now, remove the 'eTag' annotation from your file. diff --git a/docs/source/tutorials.rst b/docs/source/tutorials.rst new file mode 100644 index 000000000..6ecc1f6ff --- /dev/null +++ b/docs/source/tutorials.rst @@ -0,0 +1,95 @@ +Tutorials +========= + + +Contributing your manifest with the CLI +--------------------------------------- + +In this tutorial, you'll learn how to contribute your metadata manifests to Synapse using the `CLI`. Following best practices, +we will cover generating, validating, and submitting your manifest in a structured workflow. + +.. note:: + + Whether you have submitted manifests before to your "Top Level Folder" (see important terminology) OR are submitting a new manifest, we **strongly recommend** you to follow this workflow. + If you deviate from this workflow or upload files to Synapse directly without using schematic, you risk the errors outlined in the + troubleshooting section of the documentation. + + Question: What if I've already gone through this workflow, can I download the manifest, modify it and upload it to Synapse without Schematic? + + Answer: Yes, but you risk running into errors when others use these commands. + Updates may have been made to the data model by the DCC and these changes won't be reflected unless you regenerate your manfiest. + We strongly recommend not doing that. + + +Prerequisites +~~~~~~~~~~~~~ + +1. **Install and configure Schematic**: Ensure that you have installed `schematic` and set up its dependencies. See "Installation Guide For: Users" for more information. +2. **Important Concepts**: Make sure you know the important concepts outlined on the home page of the doc site. +3. **Configuration**: Read more here about each of the attributes in the configuration file. + +Steps to Contribute a Manifest +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The contribution process includes three main commands. +For information about the parameters of each of these commands, please refer to the CLI Reference section. + +1. **Generate** a manifest to fill out +2. **Validate** the manifest (optional, since it's included in submission) +3. **Submit** the manifest to Synapse + + +Step 1: Generate a Manifest +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The `schematic manifest get` command that creates a manifest template based on a data model and existing manifests. + +.. note:: + + This step is crucial for ensuring that your manifest includes all the necessary columns and headers. As of v24.10.2, you will + want to generate the manifest to ensure the right Filenames are populated in your manifest. If you just uploaded your folder + with data or files to a folder, you may see that this files are missing or get a `LookUp` error. This is an artifact of Synapse + fileviews, please run this command again. + +.. code-block:: bash + + schematic manifest -c /path/to/config.yml get -dt -s + +- **Data Type**: The data type or schema model for your manifest (e.g., "Patient", "Biospecimen").. + +This command will create a CSV file with the necessary columns and headers, which you can then fill with your metadata. + +Step 2: Validate the Manifest (Optional) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Though optional, `schematic model validate`` is a useful step to ensure that your manifest meets the required standards before submission. +It checks for any errors, such as missing or incorrectly formatted values. + +.. note:: + + If your manifest has an empty Component column, you will need to fill it out before validation. + +.. code-block:: bash + + schematic model -c /path/to/config.yml validate -dt -mp + +If validation passes, you'll see a success message; if there are errors, `schematic` will list them. Correct any issues before proceeding to submission. + +Step 3: Submit the Manifest to Synapse +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The `schematic model submit` command uploads your manifest to Synapse. This command will automatically validate +the manifest as part of the submission process, so if you prefer, you can skip the standalone validation step. + +.. note:: + + During the manifest submission, it will fill out the entityId column if it's missing. + +.. code-block:: bash + + schematic model -c /path/to/config.yml submit -mp -d -vc -mrt file_only + +This command will: + +- Validate your manifest +- If validation is successful, submit it to the specified "Top Level Folder" (see important terminology) in Synapse. diff --git a/docs/source/validation_rules.rst b/docs/source/validation_rules.rst new file mode 100644 index 000000000..b77f2cddd --- /dev/null +++ b/docs/source/validation_rules.rst @@ -0,0 +1,720 @@ +================ +Validation Rules +================ + +.. contents:: + :depth: 2 + :local: + :backlinks: entry + +Overview +======== + +When Schematic validates a manifest, it uses a data model. The data model contains a list of validation rules for each component(data-type). This document describes all allowed validation rules currently implemented by Schematic. + +Rules that change the validation behavior which must be taken from the following pre-specified list, formatted in the indicated ways, and added to the data model to apply. An `example data model `_ using each rule is available for reference. + +The column the rule refers to must be in the manifest for validation to happen. + +Rules can optionally be configured to raise errors and prevent manifest submission in the case of invalid entries, or warnings and allow submission when invalid entries are found within the attribute the rule is set for. Validators will be notified of the invalid values in both cases. Default message levels for each rule can be found below under each rule. + +Attributes that are not required will raise warnings when invalid entries are identified. For attributes that are not required, if a user does not submit a response, a warning or error will no longer be logged. If you desire raising an error for non-entries, set Required to True in the data model. + +Metadata validation is an explicit step in the submission pipeline and must be run either before or during manifest submission. + +Validation Types +================ + +Validation rules are just one type of validation run by Schematic to ensure that submitted manifests conform to the expectations set by the data model. + +This page details how to use Validation Rules, but please refer to `this documentation `_ to learn about the other types of validation. + +Rule Implementation +=================== + +Some validation rules are handles by Schematic itself, while others are handled by using the Great Expectations library. + +================ ======== ======================= ====================== +Rule In-House Great Expectations (GX) JSON Schema Validation +================ ======== ======================= ====================== +list ✓ +regex module ✓ +float ✓ ✓ +int ✓ ✓ +num ✓ ✓ +string ✓ ✓ +url ✓ +matchAtLeastOne ✓ +matchExactlyOne ✓ +matchNone ✓ +recommended ✓ +protectAges ✓ +unique ✓ +inRange ✓ +date ✓ +required ✓ +valid values ✓ +================ ======== ======================= ====================== + +Rule Types and Details +====================== + +List Validation Type +-------------------- + +list +~~~~ + +- Use to parse the imported value to a list of values and (optionally) to verify that the user provided value was a comma separated list, depending on how strictly entries must conform to the list structure. Values can come from Valid Values. + +- Format: + + - ``list `` + + - ``list strict`` + + - Validates that entries are comma separated lists, and parses into list + + - Requires all attribute entries to be comma-delimited, even lists with only one element (lists with a trailing comma) + + - ``list like`` + + - Assume entries are either lists or like a list but do not verify that entries are comma separated lists, and attempt to parse into a list + + - Single values, or lists of length one, can be entered without a comma delimiter + +- Can use ``list`` rule in conjunction with ``regex`` rule to validate that the items in a list follow a specific pattern. + + - See the ``list::regex`` rule below in rule combinations. + + - All the values in the list need to follow the same pattern. This is ideal for when users need to provide a list of IDs. + +- Default behavior: raises ``error`` + +Regex Validation Type +--------------------- + +regex +~~~~~ + +- Use the ``regex`` validation rule when you want to require that a user input values in a specific format, i.e. an ID that follows a particular format. + +- Format: + + - ``regex `` + + - Module: is the Python ``re`` module that you want to use. A common one would be search. Refer to Python ``re`` source material to find the most appropriate module to use. + + - Single spaces separate the three strings. + +- Example: + + - ``regex search [0-9]{4}\/[0-9]*`` + + - The regular expression defined above allows comparison to an expected format of a histological morphology code. + +- Default behavior: raises ``error`` + +.. note:: + `regex101.com `_ is a tool that can be used to build and validate the behavior of your regular expression + If the module specified is match for a given attribute's validation rule, regex match validation will be preformed in Google Sheets (but not Excel) real-time during metadata entry. + The ``strict_validation parameter`` (in the `config.yml `_ file for CLI or in manifest generation REST API calls) sets whether to stop the user from entering incorrect information in a Google Sheets cell (``strict_validation = true``) or simply throws a warning (``strict_validation = false``). Default: ``true``. + ``regex`` validation in Google Sheets is different than standard regex validation (for example, it does not support validation of digits). See `this documentation `_ for details on Google regex syntax. It is up to the user/modeler to validate that ``regex match`` is working in their manifests, as intended. This is especially important if the ``strict_validation`` parameter is set to ``True`` as users will be blocked from entering incorrect data. If you are using Google Sheets and do not want to use real-time validation use ``regex search`` instead of ``regex match``. + + +Type Validation Type +-------------------- + +- Format: + + - `` `` + + - The first parameter is type and must be one of [ ``float``, ``int``, ``num``, ``str``] + + - The second optional parameter is the msg level and must be one of [ ``error``, ``warning`` ], defaults to ``error``. + +- Examples: [ ``str``, ``str error``, ``str warning``] + +float +~~~~~ + +- Checks that the value is a float. + +int +~~~ + +- Checks that the value is an integer. + +num +~~~ + +- Checks that the value is either an integer or float. + +str +~~~ + +- Checks that the value is a string (not a number). + +URL Validation Type +------------------- + +url +~~~ + +- Using the ``url`` rule implies the user should add a URL to a free text box as a string. This function will check that the user has provided a usable URL. It will check for any standard URL error and throw an error if one is found. Further additions to this rule can allow for checking that a specific type of URL is added. For example, if the user needs to ensure that the input contains a http://protocols.io URL string, http://protocols.io can be added after url to perform this check. + +- Format: + + - ``url `` + + - ``url`` must be specified first then an arbitrary number of strings can be added after (separated by spaces) to add additional levels of specificity. + + - Alternatively, its valid to pass only ``url`` to simply check if the input is a url. + +- Examples: + + - ``url http://protocols.io`` Will check that any input is a valid URL, and will also check to see that the URL contains the string ``http://protocols.io`` If not, an error will be raised. + + - ``url dx.doi http://protocols.io`` Will check that any input is a valid URL, and will also check to see that the URL contains the strings ``dx.doi`` and ``http://protocols.io``. If not, an error will be raised. + +- Default behavior: raises ``error`` + +Required Validation Type +------------------------ + +required +~~~~~~~~ + + +An attribute's requirement is typically set using the required column (csv) or field (JSONLD) in the data model. A ``True`` value means a users must supply a value, ``False`` means they are allowed to skip providing a value. + +Some users may want to use the same attribute across several manifests, but have different requirements based on the manifest/component. For example, say the data model contains an attribute called PatientID, and this attribute is used in manifests Biospecimen, Patient and Demographics. Say the modeler wants to require that PatientID be required in the Patient manifest but not Biospecimen or Demographics. In the standard Data Model format, there is only one requirement option per Attribute, so one would not be able to set requirements per component. But with the advent of component based rule settings, this can now be achieved. + +Requirements can be specified per component by setting the required field in the data model to ``False``, and using component based rule setting along with the required "rule". + +.. note:: + This new required validation rule is not a traditional validation rule, but rather impacts the JSON validation schema. This means requirements propagate automatically to manifests as well. + + + +When using the ``required`` validation rule, the ``Required`` column must ``False`` in the CSV, or the ``Required`` must be set to ``False`` in the JsonLD or this will cause the rule to not work as expected (i.e. components were the attribute is expected to not be required due to the validation rules, will still be required). + +.. note:: + + While using the CLI, a warning will be raised for discrepancies in requirements settings are found when running validation. + +- ``required`` can be used in conjunction with other rules, without restriction. + +- The messaging level, like all JSON validation checks, is always set at ``error``, and not modifiable. + +- ``required`` does not work with other rule modifiers, such as ``warning``, ``error`` etc… + + - Though it will not throw an error if rule modifiers are added, it will not work as intended, and a warning will appear + + - For example, if the rule ``^^#Biospecimen required warning``, is added to the data model a warning will be raised letting the user know that the rule modifier cannot be applied to required. + +- Using the ``required`` validation rule is the equivalent of putting ``True`` in the ``Required`` column of the CSV. If the ``Required`` column is ``False``, and the ``required`` validation rule is used, the validation rule will override the ``Required`` column. + +- Controlling ``required`` through the validation rule will also impact Manifest formatting (in terms of required column highlighting). + + - To verify that the ``required`` rule is working as expected, you can generate all impacted manifests—required, and columns should appear highlighted in light blue. + +Examples: + +- ``#BiospecimenManifest required`` + + - For ``BiospecimenManifest`` manifests, if values are missing, an error will be raised. + + - For all other manifests, filling out values for the attribute is optional. + +- ``#Demographics required^^#BiospecimenManifest required^^`` + + - For ``Demographics`` and ``BiospecimenManifest`` manifests, values are required to be supplied, if they are not supplied an error will be raised. + + - For all other manifests this attribute is not required. + +Cross-manifest Validation Type +------------------------------ + +Use cross-manifest validation rules when you want to check the values of an attribute in the manifest being validated against an attribute in the manifest(s) of a different component. For example, if a sample manifest has a patient id attribute and you want to check it against the id attribute of patient manifests. + +The format for cross-validation is: `` . `` + +There are three rules that do cross-manifest validation: [``matchAtLeastOne``, ``matchExactlyOne``, ``matchNone``] + +There are two scopes to choose from: [ ``value``, ``set``] + +Value Scope +~~~~~~~~~~~ + +When the value scope is used all values from the target attribute in all target manifests are combined. The values from the manifest being validated are compared to this combined list. In other words, there is no distinction between what values came from what target manifest. + +matchAtleastOne Value Scope +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The manifest is validated if each value in the target attribute exists at least once in the combined values of the target attribute of the target manifests. + +matchExactlyOne Value Scope +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The manifest is validated if each value in the target attribute exists once, and only once, in the combined values of the target attribute of the target manifests. + +matchNone Value Scope +^^^^^^^^^^^^^^^^^^^^^ + +The manifest is validated if each value in the target attribute does not exist in the combined values of the target attribute of the target manifests. + +Example 1 +^^^^^^^^^ + +Tested manifest: ["A"] + +Target manifests: ["A", "B"] + +- matchExactlyOne: passes + +- matchAtleastOne: passes + +- matchNone: fails + + - because "A" is in the target manifest + +Example 2 +^^^^^^^^^ + +Tested manifest: ["A", "C"] + +Target manifests: ["A", "B"] + +- matchExactlyOne: fails + + - because "C" is not in the target manifest + +- matchAtleastOne: fails + + - because "C" is not in the target manifest + +- matchNone: fails + + - because "A" is in the target manifest + +Example 3 +^^^^^^^^^ + +Tested manifest: ["C"] + +Target manifests: ["A", "B"] + +- matchExactlyOne: fails + + - because "C" is not in the target manifest + +- matchAtleastOne: fails + + - because "C" is not in the target manifest + +- matchNone: passes + +Example 4 +^^^^^^^^^ + +Tested manifest: ["A", "A"] + +Target manifests: ["A", "B"] + +- matchExactlyOne: passes + +- matchAtleastOne: passes + +- matchNone: fails + + - because "A" is in the target manifest + +Example 5 +^^^^^^^^^ + +Tested manifest: ["A"] + +Target manifests: ["A", "A"] + +- matchExactlyOne: fails + + - because "A" is in the target manifest twice + +- matchAtleastOne: passes + +- matchNone: fails + + - because "A" is in the target manifest + +Example 6 +^^^^^^^^^ + +Tested manifest: ["A"] + +Target manifests: ["A"], ["A"] + +matchExactlyOne: fails + +because "A" is in both target manifests + +matchAtleastOne: passes + +matchNone: fails + +because "A" is in the target manifest + +Example 7 +^^^^^^^^^ + +Tested manifest: ["A"] + +Target manifests: ["A", "B"], ["A", "B"] + +- matchExactlyOne: fails + + - because "A" is in both target manifests + +- matchAtleastOne: passes + +- matchNone: fails + + - because "A" is in the target manifest + +Set scope +~~~~~~~~~ + +When the set scope is used the values from the tested manifest are compared **one at a time** against each target manifest, and the number of matches are counted. The test to determine if the tested manifest matches the target manifest is to see if the tested manifest values are a subset of the target manifest values. Imagine a target manifest who's values are ["A", "B" "C"]: + +- [ ], ["A"], ["A", "A"], ["A", "B", "C"] are all subsets of the example target manifest. + +- [1], ["D"], ["D", "D"], ["D", "E"] are not subsets of the example target manifest. + +matchAtleastOne Set scope +^^^^^^^^^^^^^^^^^^^^^^^^^ + +The manifest is validated if there is atleast one set match between the tested manifest and the target manifests + +matchExactlyOne Set scope +^^^^^^^^^^^^^^^^^^^^^^^^^ + +The manifest is validated if there is one and only one set match between the tested manifest and the target manifests + +matchNone Set scope +^^^^^^^^^^^^^^^^^^^ + +The manifest is validated if there are no set match between the tested manifest and the target manifests + +Example 1 +^^^^^^^^^ + +Tested manifest: ["A"] + +Target manifests: ["A", "B"] + +matchExactlyOne: passes + +matchAtleastOne: passes + +matchNone: fails + +because "A" is in the target manifest + +Example 2 +^^^^^^^^^ + +Tested manifest: ["A"] + +Target manifests: ["A", "B"], ["C", "D"] + +- matchExactlyOne: passes + +- matchAtleastOne: passes + +- matchNone: fails + + - because "A" is in atleast one of the target manifest + +Example 3 +^^^^^^^^^ + +Tested manifest: ["A"] + +Target manifests: ["A", "B"], ["A", "B"] + +- matchExactlyOne: fails + + - because "A" is in more than one target manifest + +- matchAtleastOne: passes + +- matchNone: fails + + - because "A" is in atleast one of the target manifests + +Example 4 +^^^^^^^^^ + +Tested manifest: ["C"] + +Target manifests: ["A", "B"] + +- matchExactlyOne: fails + + - because "C" is not in the target manifest + +- matchAtleastOne: fails + + - because "C" is not in the target manifest + +- matchNone: passes + +Content Validation Type +----------------------- + +Rules can be used to validate the contents of entries for an attribute. + +recommended +~~~~~~~~~~~ + +- Use to raise a warning when a manifest column is not required but empty. If an attribute is always necessary then ``required`` should be set to ``TRUE`` instead of using the ``recommended`` validation rule. + +- Format: + + - ``recommended `` + +- Examples: + + - ``recommended`` + +- Default behavior: raises ``warning`` + +protectAges +~~~~~~~~~~~ + +- Use to ensure that patient ages under 18 and over 89 years of age are censored when uploading for sharing. If necessary, a censored version of the manifest will be created and uploaded along with the uncensored version. Uncensored versions will be uploaded as restricted and Terms of Use will need to be set. Please follow up with governance after upload to set the terms of use + +- Format: + + - ``protectAges `` + +- Examples: + + - ``protectAges warning`` + +- Default behavior: raises ``warning`` + +unique +~~~~~~ + +- Use to ensure that attribute values are not duplicated within a column. + +- Format: + + - ``unique `` + +- Examples: + + - ``unique error`` + +- Default behavior: raises ``error`` + +inRange +~~~~~~~ + +- Use to ensure that numerical data is within a specified range + +- Format: + + - ``inRange `` + +- Examples: + + - ``inRange 50 100 error`` + +- Default behavior: raises ``error`` + +date +~~~~ + +- Use to ensure the value parses as a date + +- Uses ``dateutils`` to parse the value + + - Can parse many formats + + - YYYY-MM-DD format is recommended + + - Every value must be read as a string so no formats such as YYYYDDMM which would be read in as an int + +- Default behavior: raises ``error`` + +Filename Validation +------------------- + +This requires paths to be enabled for the synapse master file view in use. Can be enabled by navigating to an existing view and selecting ``show view schema`` > ``edit schema`` > ``add default view columns`` > ``save``. Paths are enabled on new views by default. + +This should be used only with the Filename attribute in a data model and specified with `Component Based Rule Setting `_ + +filenameExists +~~~~~~~~~~~~~~ + +- Used to validate that the filenames and paths as they exist in the metadata manifest match the paths that are in the Synapse master File View for the specified dataset + + - Conditions in which an error is raised: + + - ``missing entityId``: The entityId field for a manifest row is null or an empty string + + - ``entityId does not exist``: The entityId provided for a manifest row does not exist within the specified dataset's file view + + - ``path does not exist``: The Filename in the manifest row does not exist within the specified dataset's file view + + - ``mismatched entityId``: The entityId and Filename do not match the expected values from the specified dataset's file view + +- Format + + - ``filenameExists `` + +- Example + + - This sets the rule for the MockFilename component ONLY with the specified dataset scope syn61682648 + + - ``#MockFilename filenameExists syn61682648^^`` + +- Default behavior: raises ``error`` + +Given this File View:: + + id,path + syn61682653,schematic - main/MockFilenameComponent/txt1.txt + syn61682659,schematic - main/MockFilenameComponent/txt4.txt + syn61682660,schematic - main/MockFilenameComponent/txt2.txt + syn61682662,schematic - main/MockFilenameComponent/txt3.txt + syn63141243,schematic - main/MockFilenameComponent/txt6.txt + + +We get the following results for this Manifest:: + + + Component,Filename,entityId + MockFilename,schematic - main/MockFilenameComponent/txt1.txt,syn61682653 # Pass + MockFilename,schematic - main/MockFilenameComponent/txt2.txt,syn61682660 # Pass + MockFilename,schematic - main/MockFilenameComponent/txt3.txt,syn61682653 # mismatched entityId + MockFilename,schematic - main/MockFilenameComponent/this_file_does_not_exist.txt,syn61682653 # path does not exist + MockFilename,schematic - main/MockFilenameComponent/txt4.txt,syn6168265 # entityId does not exist + MockFilename,schematic - main/MockFilenameComponent/txt6.txt, # missing entityId + + +Rule Combinations +----------------- + +Schematic allows certain combinations of existing validation rules to be used on a single attribute, where appropriate. + +.. note:: + - The following are the tested and validated combinations, all other combinations are not officially supported. + - isNa and required can be combined with all rules and rule combos. + +Rule combinations: [``list::regex``, ``int::inRange``, ``float::inRange``, ``num::inRange``, ``protectAges::inRange``] + +- Format: + + - `` :: `` + + - ``::`` delimiter used to separate each rule + +- Example: + + - ``list :: regex search [HTAN][0-9]{1}_[0-9]{4}_[0-9]*`` + +Component-Based Rule Setting +---------------------------- + +**Component-Based Rule Setting** is a powerful feature in data modeling that enables users to create rules tailored to specific subsets of components or manifests. This functionality was developed to address scenarios where a data modeler needs to enforce uniqueness for certain attribute values within one manifest while allowing non-uniqueness in another. + +Here's how it works: + +1. **Rule Definition at Attribute Level**: Rules are defined at the attribute level within the data model. + +2. **Manifest-Level Referencing**: These rules can then be applied (or not) to specific manifests within the data model. This means that rules can be selectively enforced based on the manifest they're associated with. + +This feature offers flexibility and applicability beyond its original use case. The new **Component-Based Rule Setting** feature provides users with the following options: + +- **Apply a Rule to All Manifests Except Specified Ones**: Users can now define a rule that applies to all manifests within the data model except for those explicitly specified. In cases where exceptions are specified, users have the flexibility to define unique rules for these exceptions or opt not to apply any rule at all. + +- **Specify a Rule for a Single Manifest**: Alternatively, users can specify a rule that applies to a single manifest exclusively. This allows for fine-grained control over rule enforcement at the manifest level. + +- **Unique Rules for Each Manifest**: Users can also define unique rules for each manifest within the data model. This enables tailored rule enforcement based on the specific requirements and characteristics of each manifest. + +By leveraging the enhanced Component-Based Rule Setting feature, data modelers can efficiently enforce rules across their data models with greater precision and flexibility, ensuring data integrity while accommodating diverse use cases and requirements. + +.. note:: + - All restrictions to rule combos and implementation also apply to component based rules. + - As always try the rule combos with mock data to ensure they are working as intended before using in production. + +- Format: + + - ``^^`` Double carrots indicate that Component-Based rules are being set + + - Use ```^^``` to separate component rule sets + + - ``#`` In the first position (prior to the rule) to define the component/manifest to apply the rule to + + - ``#`` character cannot be used without the ``^^`` to indicate component rule sets + +- Use case: + + - Apply rule to all manifests *except* the specified set. + + - ``validation_rule^^#ComponentA`` + + - ``validation_rule^^#ComponentA^^#ComponentB`` + + - Apply a unique rule to each manifest. + + - ``#ComponentA validation_rule_1^^#ComponentB validation_rule_2^^#ComponentC validation_rule_3`` + + - For the specified manifest, apply the given validation rule, but for all others, run a different rule + + - ``#ComponentA validation_rule_1^^validation_rule_2`` + + - ``validation_rule_2^^#ComponentA validation_rule_1`` + + - Apply the validation rule to only one manifest + + - ``#ComponentA validation_rule_1^^`` + +- Example Rules: + + - Test by adding these rules to the ``Patient ID`` attribute in the ``example.model.csv`` model, then run validation with new rules against the example manifests. + + - `Example Biospecimen Manifest `_ + + - `Example Patient Manifest `_ + + - **Rule**: ``#Patient int::inRange 100 900 error^^#Biospecimen int::inRange 100 900 warning`` + + - For the ``Patient`` manifest, apply the combo ``rule int::inRange 100 900`` at the ``error`` level. + + - The value provided must be an integer in the range of 100-900; if it does not fall in the range, throw an error + + - For the ``Biospecimen`` manifest, apply the combo rule ``int::inRange 100 900`` at the ``warning`` level + + - The value provided must be an integer in the range of 100-900; if it does not fall in the range, throw a warning + + - **Rule**: ``#Patient int::inRange 100 900 error^^int::inRange 100 900 warning`` + + - For the ``Patient`` manifest, apply rule ``int::inRange 100 900`` at an ``error`` level + + - For all other manifests, apply the ``rule int::inRange 100 900`` at a warning level + + - **Rule**: ``#Patient^^int::inRange 100 900 warning`` + + - For all manifests except ``Patient`` apply the rule ``int::inRange 100 900`` at the ``warning`` level + + - **Rule**: ``int::inRange 100 900 error^^#Biospecimen`` + + - Apply the rule ``int::inRange 100 900 error``, to all manifests except ``Biospecimen`` + + - **Rule**: ``#Patient unique error^^`` + + - To the ``PatientManifest`` only, apply the ``unique`` validation rule at the ``error`` level diff --git a/poetry.lock b/poetry.lock index b9b18929f..49be8b9bd 100644 --- a/poetry.lock +++ b/poetry.lock @@ -4193,6 +4193,25 @@ click = ">=7.0" docutils = "*" sphinx = ">=2.0" +[[package]] +name = "sphinx-rtd-theme" +version = "3.0.1" +description = "Read the Docs theme for Sphinx" +optional = false +python-versions = ">=3.8" +files = [ + {file = "sphinx_rtd_theme-3.0.1-py2.py3-none-any.whl", hash = "sha256:921c0ece75e90633ee876bd7b148cfaad136b481907ad154ac3669b6fc957916"}, + {file = "sphinx_rtd_theme-3.0.1.tar.gz", hash = "sha256:a4c5745d1b06dfcb80b7704fe532eb765b44065a8fad9851e4258c8804140703"}, +] + +[package.dependencies] +docutils = ">0.18,<0.22" +sphinx = ">=6,<9" +sphinxcontrib-jquery = ">=4,<5" + +[package.extras] +dev = ["bump2version", "transifex-client", "twine", "wheel"] + [[package]] name = "sphinxcontrib-applehelp" version = "2.0.0" @@ -4241,6 +4260,20 @@ lint = ["mypy", "ruff (==0.5.5)", "types-docutils"] standalone = ["Sphinx (>=5)"] test = ["html5lib", "pytest"] +[[package]] +name = "sphinxcontrib-jquery" +version = "4.1" +description = "Extension to include jQuery on newer Sphinx releases" +optional = false +python-versions = ">=2.7" +files = [ + {file = "sphinxcontrib-jquery-4.1.tar.gz", hash = "sha256:1620739f04e36a2c779f1a131a2dfd49b2fd07351bf1968ced074365933abc7a"}, + {file = "sphinxcontrib_jquery-4.1-py2.py3-none-any.whl", hash = "sha256:f936030d7d0147dd026a4f2b5a57343d233f1fc7b363f68b3d4f1cb0993878ae"}, +] + +[package.dependencies] +Sphinx = ">=1.8" + [[package]] name = "sphinxcontrib-jsmath" version = "1.0.1" diff --git a/pyproject.toml b/pyproject.toml index 23ae91f92..a5dc23a8d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -98,6 +98,7 @@ pre-commit = "^3.6.2" [tool.poetry.group.doc.dependencies] pdoc = "^14.0.0" +sphinx-rtd-theme = "3.0.1" sphinx = "7.3.7" sphinx-click = "4.4.0" diff --git a/schematic/store/synapse.py b/schematic/store/synapse.py index 6e897e408..8410f0a1e 100644 --- a/schematic/store/synapse.py +++ b/schematic/store/synapse.py @@ -3247,7 +3247,7 @@ def upsertTable(self, dmge: DataModelGraphExplorer): Method to upsert rows from a new manifest into an existing table on synapse For upsert functionality to work, primary keys must follow the naming convention of _id `-tm upsert` should be used for initial table uploads if users intend to upsert into them at a later time; using 'upsert' at creation will generate the metadata necessary for upsert functionality. - Currently it is required to use -dl/--use_display_label with table upserts. + Currently it is required to use -tcn "display label" with table upserts. Args: