Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
103 commits
Select commit Hold shift + click to select a range
391aabe
.
jaymedina Oct 31, 2024
3d1fecb
Also lock sphinx-click
jaymedina Oct 31, 2024
b8d6f6c
Update README.md
thomasyu888 Nov 1, 2024
97abaaa
move documentation libraries to doc group. update README. write new l…
jaymedina Nov 1, 2024
e5cfa79
Update copyright
thomasyu888 Nov 2, 2024
a80c1c3
Merge branch 'fds-2449-fix-rtd' into thomasyu888-patch-1
thomasyu888 Nov 2, 2024
7e83244
update version and copyright
thomasyu888 Nov 2, 2024
de14646
Add in documentation
thomasyu888 Nov 2, 2024
e0d3e10
Add to documentation
thomasyu888 Nov 2, 2024
070d58b
Edit dependencies
thomasyu888 Nov 2, 2024
028f6e8
Add to documentation
thomasyu888 Nov 2, 2024
2ed9764
edit documentation
thomasyu888 Nov 2, 2024
4ea3122
Add more details
thomasyu888 Nov 2, 2024
356a722
Add in configuration
thomasyu888 Nov 2, 2024
b1493b0
Add in more terms
thomasyu888 Nov 2, 2024
92c2db6
Edit
thomasyu888 Nov 2, 2024
99bc347
Add documentation
thomasyu888 Nov 2, 2024
5b99d89
Add documentation
thomasyu888 Nov 2, 2024
ceefd9e
Add --all-extras
thomasyu888 Nov 2, 2024
55498a9
Add typing extensions
thomasyu888 Nov 2, 2024
7292112
Add details
thomasyu888 Nov 2, 2024
8d272ef
Add documentation
thomasyu888 Nov 2, 2024
625d712
Add tutorials and troubleshooting docs
thomasyu888 Nov 5, 2024
198ed52
Add tutorial for contributing manifests
thomasyu888 Nov 5, 2024
b759ba8
Add a line
thomasyu888 Nov 5, 2024
c67ce61
Add to documentation
thomasyu888 Nov 5, 2024
8cf4342
Remove precommit
thomasyu888 Nov 5, 2024
2c1f491
Add etag value error
thomasyu888 Nov 5, 2024
abef1f6
Add schematic config section
thomasyu888 Nov 5, 2024
ad3b402
Add documentation
thomasyu888 Nov 5, 2024
0479961
Add debugging string
thomasyu888 Nov 5, 2024
f9bd85f
Fix merge conflicts
thomasyu888 Nov 5, 2024
ed2b231
Add notes
thomasyu888 Nov 5, 2024
9bf1200
Use top level fodler
thomasyu888 Nov 5, 2024
b49ec85
Fix
thomasyu888 Nov 5, 2024
1d36c37
Edit
thomasyu888 Nov 5, 2024
1ee5a80
Add more documentation around asset stores
thomasyu888 Nov 5, 2024
d50e7a6
Update section title
thomasyu888 Nov 5, 2024
42101f7
Fix formatting
thomasyu888 Nov 6, 2024
3f4b563
Edit troubleshotting docs
thomasyu888 Nov 6, 2024
3abe8ef
Add more inforamtion
thomasyu888 Nov 6, 2024
fb91a47
Add notes
thomasyu888 Nov 6, 2024
0b6ef68
Update docs
thomasyu888 Nov 6, 2024
9bd4055
Update docs
thomasyu888 Nov 6, 2024
1b43078
Add data layout
thomasyu888 Nov 6, 2024
5de2a25
Fix
thomasyu888 Nov 6, 2024
35a116d
Fix lock file
thomasyu888 Nov 7, 2024
b82d982
Fix merge conflicts
thomasyu888 Nov 7, 2024
16df9fc
Fix merge fonflicts
thomasyu888 Nov 14, 2024
a0d9da8
Add in permissions
thomasyu888 Nov 19, 2024
a49a40c
Fix merge conflicts
thomasyu888 Nov 19, 2024
b1cf2dd
Merge branch 'develop' into thomasyu888-patch-1
thomasyu888 Dec 13, 2024
970d0a6
Update docs
thomasyu888 Dec 13, 2024
558f382
Merge branch 'develop' into thomasyu888-patch-1
thomasyu888 Dec 17, 2024
8fd545b
Add documentation
thomasyu888 Dec 17, 2024
500f5b9
Fix merge conflicts
thomasyu888 Dec 18, 2024
16a23e2
Merge branch 'develop' into thomasyu888-patch-1
thomasyu888 Mar 11, 2025
db96546
Remove trailing white space
thomasyu888 Mar 11, 2025
5cbe625
Update poetry lock file
thomasyu888 Mar 11, 2025
4ff05ee
Lock urllib version
thomasyu888 Mar 11, 2025
d6d1a36
First comment out the mypy step
thomasyu888 Mar 11, 2025
91f57c9
add markdown file linking to LinkML docs
andrewelamb Apr 2, 2025
9436d14
Revert "add markdown file linking to LinkML docs"
andrewelamb Apr 2, 2025
489e2f5
[schematic-252] Added documentation for manifest generation (#1583)
linglp Apr 3, 2025
8dc9264
[schematic-254] added documentation for manifest submission (#1584)
linglp Apr 3, 2025
b709a5a
[SCHEMATIC-250] Add markdown file linking to LinkML docs (#1588)
andrewelamb Apr 3, 2025
a870b00
Fix merge conflicts
thomasyu888 Apr 3, 2025
31974f0
Schematic 253 validation documentation (#1589)
SageGJ Apr 4, 2025
f0f7f6f
Reorder pages
thomasyu888 Apr 4, 2025
a4cf013
Add mypy back in
thomasyu888 Apr 4, 2025
a09e26f
[SCHEMATIC-264] Added validation rules doc (#1590)
andrewelamb Apr 7, 2025
566f036
Update pyproject.toml
thomasyu888 Apr 7, 2025
f8a9cc8
Fix comment
thomasyu888 Apr 7, 2025
fe03a62
Use note:
thomasyu888 Apr 7, 2025
bc4056f
Reorder documentation
thomasyu888 Apr 7, 2025
ecff35c
Add notes
thomasyu888 Apr 8, 2025
186fb4d
Try checkmark
thomasyu888 Apr 8, 2025
a63348c
Add ToC
thomasyu888 Apr 8, 2025
e2a94b1
Use checkmarks
thomasyu888 Apr 8, 2025
32422a4
Fix url linking
thomasyu888 Apr 8, 2025
9a60e77
update trouble shooting guide
linglp Apr 9, 2025
8f7b872
fix format
linglp Apr 9, 2025
f5b57f1
reduce unnecessary spacing
linglp Apr 9, 2025
1bd5d91
fix format
linglp Apr 9, 2025
4bd66fc
fix format
linglp Apr 9, 2025
391af4f
fix format
linglp Apr 9, 2025
a67034e
fix format
linglp Apr 9, 2025
594138d
fix format
linglp Apr 9, 2025
d552cee
fix indentation
linglp Apr 9, 2025
b0be03d
add example and clarify
linglp Apr 9, 2025
f2390d0
fix format
linglp Apr 9, 2025
9fd504e
fix indentation
linglp Apr 9, 2025
1493918
add instructions to clarify
linglp Apr 9, 2025
852d088
Update docs/source/troubleshooting.rst
linglp Apr 10, 2025
7568b33
add synapse reserved word to trouble shooting guide
linglp Apr 10, 2025
ebe9035
Merge branch 'schematic-272-doc' of https://github.com/Sage-Bionetwor…
linglp Apr 10, 2025
4110aec
fix indentation
linglp Apr 10, 2025
47b009a
fix indentation
linglp Apr 10, 2025
b52114a
[SCHEMATIC-264] Additional validation rules documentation fixes (#1595)
andrewelamb Apr 11, 2025
57334a6
adding more reserved word from entity view
linglp Apr 11, 2025
e5869db
edit description
linglp Apr 11, 2025
2db71ac
erge branch 'develop' into schematic-272-doc
linglp Apr 11, 2025
42b9fd9
Merge pull request #1593 from Sage-Bionetworks/schematic-272-doc
linglp Apr 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2021 Sage Bionetworks
Copyright (c) 2025 Sage Bionetworks

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
19 changes: 15 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -570,12 +570,23 @@ For internal developers with access to SigNoz Cloud, you can obtain an ingestion

# Contributors

Main contributors and developers:

Sage main contributors and developers:

- [Gianna Jordan](https://github.com/giajordan)
- [Lingling Peng](https://github.com/linglp)
- [Bryan Fauble](https://github.com/BryanFauble)
- [Andrew Lamb](https://github.com/andrewelamb)
- [Brad Macdonald](https://github.com/BWMac)
- [Milen Nikolov](https://github.com/milen-sage)

## Alumni
- [Mialy DeFelice](https://github.com/mialy-defelice)
- [Sujay Patil](https://github.com/sujaypatil96)
- [Bruno Grande](https://github.com/BrunoGrandePhD)
- [Robert Allaway](https://github.com/allaway)
- [Gianna Jordan](https://github.com/giajordan)
- [Lingling Peng](https://github.com/linglp)
- [Jason Hwee](https://github.com/hweej)
- [Xengie Doan](https://github.com/xdoan)
- [James Eddy](https://github.com/jaeddy)
- [Yooree Chae](https://github.com/ychae)

See all [contributors](https://github.com/Sage-Bionetworks/schematic/graphs/contributors)
138 changes: 138 additions & 0 deletions docs/source/asset_store.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
Setting up your asset store
===========================

.. note::

You can ignore this section if you are just trying to contribute manifests.

This document covers the minimal recommended elements needed in Synapse to interface with the Data Curator App (DCA) and provides options for Synapse project layout.

There are two options for setting up a DCC Synapse project:

1. **Distributed Projects**: Each team of DCC contributors has its own Synapse project that stores the team's datasets.
2. **Single Project**: All DCC datasets are stored in the same Synapse project.

In each of these project setups, there are two ways you can lay out your data:

1. **Flat Data Layout**: All top level folders structured under the project

.. code-block:: shell

my_flat_project
├── biospecimen
└── clinical

2. **Hierarchical Data Layout**: Top level folders are stored within nested folders annotated with ``contentType: dataset``

.. note::

This requires you to add the column ``contentType`` to your fileview schema.

.. code-block:: shell

my_heirarchical_project
├── biospecimen
│ ├── experiment_1 <- annotated
│ └── experiment_2 <- annotated
└── clinical
├── batch_1 <- annotated
└── batch_2 <- annotated


Option 1: Distributed Synapse Projects
--------------------------------------

Pick **option 1** if you answer "yes" to one or more of the following questions:

- Does the DCC have multiple contributing institutions/labs, each with different data governance and access controls?
- Does the DCC have multiple institutions with limited cross-institutional sharing?
- Will contributors submit more than 100 datasets per release or per month?
- Are you not willing to annotate each DCC dataset folder with the annotation ``contentType:dataset``?

Access & Project Setup - Multiple Contributing Projects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. Create a DCC Admin Team with admin permissions.
2. Create a Team for each data contributing institution. Begin with a "Test Team" if all teams are not yet identified.
3. Create a Synapse Project for each institution and grant the respective team **Edit** level access.

- E.g., for institutions A, B, and C, create Projects A, B, and C with Teams A, B, and C. Team A has **Edit** access to Project A, etc.

4. Within each project, create "top level folders" in the **Files** tab for each dataset type.
5. Create another Synapse Project (e.g., MyDCC) containing the main **Fileview** that includes in the scope all the DCC projects.

- Ensure all teams have **Download** level access to this file view.
- Include both file and folder entities and add **ALL default columns**.

.. note::

Note: If you want to upload data according to hierachical data layout, you can still use
distributed projects, just the ``contentType`` column to your fileview, and you will have
to annotate your top level folders with ``contentType:dataset``.


Option 2: Single Synapse Project
--------------------------------

Pick **option 2** if you don't select option 1 and you answer "yes" to any of these questions:

- Does the DCC have a project with pre-existing datasets in a complex folder hierarchy?
- Does the DCC envision collaboration on the same dataset collection across multiple teams with shared access controls?
- Are you willing to set up local access control for each dataset folder and annotate each with ``contentType: dataset``?

If neither option fits, select option 1.


Access & Project Setup - Single Contributing Project
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. Create a Team for each data contributing institution.
2. Create a single Synapse Project (e.g., MyDCC).
3. Within this project, create dataset folders for each contributor. Organize them as needed.

- Annotate ``contentType: dataset`` for each top level folder, which should not nest inside other dataset folders and must have unique names.
Taking the above example, you cannot have something like this:

.. code-block:: shell

my_heirarchical_project
├── biospecimen
│ ├── experiment_1 <- annotated
│ └── experiment_2 <- annotated
└── clinical
├── experiment_1 <- this is not allowed, because experiment_1 is duplicated
└── batch_2 <- annotated

4. In MyDCC, create the main **DCC Fileview** with `MyDCC` as the scope. Add column ``contentType`` to the schema and grant teams **Download** level access.

- Ensure all teams have **Download** level access to this file view.
- Add both file and folder entities and add **ALL default columns**.

.. note::

You can technically use the flat data layout with a single project setup, but it is not recommended
as if you have different data contributors contributing similar datatypes, it would lead to a
proliferation of folders per contributor and data type.

Synapse External Cloud Buckets Setup
------------------------------------

If DCC contributors require external cloud buckets, select one of the following configurations. For more information on how to
set this up on Synapse, view this documentation: https://help.synapse.org/docs/Custom-Storage-Locations.2048327803.html

1. **Basic External Storage Bucket (Default)**:

- Create an S3 bucket for Synapse uploads via web or CLI. Contributors will upload data without needing AWS credentials.
- Provision an S3 bucket, attach it to the Synapse project, and create folders for specific assay types.

2. **Custom Storage Location**:

This is an advanced setup for users that do not want to upload files directly via the Synapse API, but rather
create pointers to the data.

- For large datasets or if contributors prefer cloud storage, enable uploads via AWS CLI or GCP CLI.
- Configure the custom storage location with an AWS Lambda or Google Cloud function for syncing.
- If using AWS, provision a bucket, set up Lambda sync, and assign IAM write access.
- For GCP, use Google Cloud function sync and obtain contributor emails for access.

Finally, set up a `synapse-service-lambda` account for syncing external cloud buckets with Synapse, granting "Edit & Delete" permissions on the contributor's project.
39 changes: 39 additions & 0 deletions docs/source/cli_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,45 @@
CLI Reference
=============

When you're using this tool ``-d`` flag is referring to the Synapse ID of a folder that would be found under the files tab
that contains a manifest and data. This would be referring to a "Top Level Folder". It is not required to provide a ``dataset_id``
but if you're trying to pull existing annotations by using the ``-a`` flag and the manifest is file-based then you would
need to provide a ``dataset_id``.


Generate a new manifest as a Google Sheet
-----------------------------------------


.. code-block:: shell

schematic manifest -c /path/to/config.yml get -dt <your data type> -s

Generate an existing manifest from Synapse
------------------------------------------

.. code-block:: shell

schematic manifest -c /path/to/config.yml get -dt <your data type> -d <your synapse "Top Level Folder" folder id> -s

Validate a manifest
-------------------

.. code-block:: shell

schematic model -c /path/to/config.yml validate -dt <your data type> -mp <your csv manifest path>

Submit a manifest as a file
---------------------------

.. code-block:: shell

schematic model -c /path/to/config.yml submit -mp <your csv manifest path> -d <your synapse "Top Level Folder" id> -vc <your data type> -mrt file_only


In depth guide
--------------

.. click:: schematic.__main__:main
:prog: schematic
:nested: full
14 changes: 11 additions & 3 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
import os
import sys

import sphinx_rtd_theme

file_dir = os.path.dirname(__file__)
sys.path.append(file_dir)
import pathlib
Expand All @@ -27,7 +29,7 @@

toml_metadata = _parse_toml(toml_file_path)
project = toml_metadata["name"]
copyright = "2022, Sage Bionetworks"
copyright = "2024, Sage Bionetworks"

author = toml_metadata["authors"]

Expand All @@ -40,7 +42,7 @@
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ["sphinx_click"]
extensions = ["sphinx_click", "sphinx_rtd_theme"]

# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
Expand All @@ -57,15 +59,21 @@
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []

# The master toctree document.
master_doc = "index"

# -- Options for HTML output -------------------------------------------------

# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = "alabaster"
html_theme = "sphinx_rtd_theme"

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]

html_theme_options = {
"collapse_navigation": False,
}
86 changes: 86 additions & 0 deletions docs/source/configuration.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
.. _configuration:

Configure Schematic
===================

This is an example config for Schematic. All listed values are those that are the default if a config is not used. Remove any fields in the config you don't want to change.
If you remove all fields from a section, the entire section should be removed including the header.
Change the values of any fields you do want to change. Please view the installation section for details on how to set some of this up.

.. code-block:: yaml

# This describes where assets such as manifests are stored
asset_store:
# This is when assets are stored in a synapse project
synapse:
# Synapse ID of the file view listing all project data assets.
master_fileview_id: "syn23643253"
# Path to the synapse config file, either absolute or relative to this file
config: ".synapseConfig"
# Base name that manifest files will be saved as
manifest_basename: "synapse_storage_manifest"

# This describes information about manifests as it relates to generation and validation
manifest:
# Location where manifests will saved to
manifest_folder: "manifests"
# Title or title prefix given to generated manifest(s)
title: "example"
# Data types of manifests to be generated or data type (singular) to validate manifest against
data_type:
- "Biospecimen"
- "Patient"

# Describes the location of your schema
model:
# Location of your schema jsonld, it must be a path relative to this file or absolute
location: "tests/data/example.model.jsonld"

# This section is for using google sheets with Schematic
google_sheets:
# Path to the google service account creds, either absolute or relative to this file
service_acct_creds: "schematic_service_account_creds.json"
# When doing google sheet validation (regex match) with the validation rules.
# true is alerting the user and not allowing entry of bad values.
# false is warning but allowing the entry on to the sheet.
strict_validation: true


This document will go into detail what each of these configurations mean.

Asset Store
-----------

Synapse
~~~~~~~
This describes where assets such as manifests are stored and the configurations of the asset store is described
under the asset store section.

* master_fileview_id: Synapse ID of the file view listing all project data assets.
* config: Path to the synapse config file, either absolute or relative to this file. Note, if you use `synapse config` command, you will have to provide the full path to the configuration file.
* manifest_basename: Base name that manifest files will be saved as on Synapse. The Component will be appended to it so for example: `synapse_storage_manifest_biospecimen.csv`

Manifest
--------
This describes information about manifests as it relates to generation and validation. Note: some of these configurations can be overwritten by the CLI commands.

* manifest_folder: Location where manifests will saved to. This can be a relative or absolute path on your local machine.
* title: Title or title prefix given to generated manifest(s). This is used to name the manifest file saved locally.
* data_type: Data types of manifests to be generated or data type (singular) to validate manifest against. If you wanted all the available manifests, you can input "all manifests"


Model
-----
Describes the location of your schema

* location: This is the location of your schema jsonld, it must be a path relative to this file or absolute path. Currently URL's are NOT supported, so you will have to download the jsonld data model. Here is an example: https://raw.githubusercontent.com/ncihtan/data-models/v24.9.1/HTAN.model.jsonld

Google Sheets
-------------
Schematic leverages the Google API to generate manifests. This section is for using google sheets with Schematic

* service_acct_creds: Path to the google service account creds, either absolute or relative to this file. This is the path to the service account credentials file that you download from Google Cloud Platform.
* strict_validation: When doing google sheet validation (regex match) with the validation rules.

* True is alerting the user and not allowing entry of bad values.
* False is warning but allowing the entry on to the sheet.
Loading