Skip to content

Commit d15c5b7

Browse files
authored
Merge pull request #35 from Sage-Bionetworks-Workflows/ORCA-297-update-demo
[ORCA-297] Fixing tutorial demo (`demo.py` and documentation)
2 parents 433e7ae + 0517118 commit d15c5b7

File tree

2 files changed

+21
-11
lines changed

2 files changed

+21
-11
lines changed

README.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ This Python package provides the components to connect various third-party servi
1313

1414
## Demonstration Script
1515

16-
This repository includes a demonstration script called [`demo.py`](demo.py), which showcases how you can use `py-orca` to launch and monitor your workflows on Nextflow Tower. Specifically, it illustrates how to process an RNA-seq dataset using a series of workflow runs, namely `nf-synstage`, `nf-core/rnaseq`, and `nf-synindex`. `py-orca` can be used with any Python-compatible workflow management system to orchestrate each step (_e.g._ Airflow, Prefect, Dagster). The demonstration script uses [Metaflow](https://metaflow.org/) because it's easy to run locally and has an intuitive syntax.
16+
This repository includes a demonstration script called [`demo.py`](demo.py), which showcases how you can use `py-orca` to launch and monitor your workflows on Nextflow Tower. Specifically, it illustrates how to process an RNA-seq dataset using a series of workflow runs, namely `nf-synapse/synstage`, `nf-core/rnaseq`, and `nf-synindex`. `py-orca` can be used with any Python-compatible workflow management system to orchestrate each step (_e.g._ Airflow, Prefect, Dagster). The demonstration script uses [Metaflow](https://metaflow.org/) because it's easy to run locally and has an intuitive syntax.
1717

1818
The script assumes that the following environment variables are set. Before setting them up, ensure that you have an AWS profile configured for a role that has access to the dev/ops tower workspace you plan to launch your workflows from. You can set these environment variables using whatever method you prefer (_e.g._ using an `.env` file, sourcing a shell script, etc).
1919
Refer to [`.env.example`](.env.example) for the format of their values as well as examples.
@@ -22,7 +22,7 @@ Refer to [`.env.example`](.env.example) for the format of their values as well a
2222
- `SYNAPSE_CONNECTION_URI`
2323
- `AWS_PROFILE` (or another source of AWS credentials)
2424

25-
Once your environment is set, you can create a virtual environment, install the Python dependencies, and run the demonstration script (after downloading it) as follows. Note that you will need to update the `s3_prefix` parameter so that it points to an S3 bucket that is accessible to your Tower workspace.
25+
Once your environment variables are set, you can create a virtual environment, install the Python dependencies, and run the demonstration script (after downloading it) as follows. Note that you will need to update the `s3_prefix` parameter so that it points to an S3 bucket that is accessible to your Tower workspace.
2626

2727
### Creating and setting up your py-`orca` virtual environment and executing `demo.py`
2828

@@ -34,11 +34,18 @@ source venv/bin/activate
3434

3535
# Install Python dependencies
3636
python3 -m pip install 'py-orca[all]' 'metaflow' 'pyyaml' 's3fs'
37+
```
3738

39+
Before running the example below, ensure that the `s3_prefix` points to an S3 bucket your Nextflow `dev`
40+
or `prod` tower workspace has access to. In the example below, we will point to the `example-dev-project-tower-scratch` S3 bucket because we will be launching our workflows within the
41+
`example-dev-project` workspace in `tower-dev`.
42+
```bash
3843
# Run the script using an example dataset
39-
python3 demo.py run --dataset_id 'syn51514585' --s3_prefix 's3://orca-service-test-project-tower-bucket/outputs'
44+
python3 demo.py run --dataset_id 'syn51514585' --s3_prefix 's3://example-dev-project-tower-scratch/work'
4045
```
4146

47+
Once your run takes off, you can follow the output logs in your terminal, or stay updated with your workflow progress on the web client. Be sure that your `synstage` workflow run has a unique name, and is not an iteration of a previous run (i.e. `my_test_dataset_synstage_2`, `my_test_dataset_synstage_3`, and so on). This is because the `demo.py` script does not currently support being able to locate the staged samplesheet file if it has been staged under a run name that is non-unique.
48+
4249
The above dataset ID ([`syn51514585`](https://www.synapse.org/#!Synapse:syn51514585)) refers to the following YAML file, which should be accessible to Sage employees. Similarly, the samplesheet ID below ([`syn51514475`](https://www.synapse.org/#!Synapse:syn51514475)) should also be accessible to Sage employees. However, there is no secure way to make the output folder accessible to Sage employees, so the `synindex` step will fail if you attempt to run this script using the example dataset ID. This should be sufficient to get a feel for using `py-orca`, but feel free to create your own dataset YAML file on Synapse with an output folder that you own.
4350

4451
```yaml

demo.py

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -40,13 +40,14 @@ def get_run_name(self, suffix: str) -> str:
4040
return f"{self.id}_{suffix}"
4141

4242
def synstage_info(self, samplesheet_uri: str) -> LaunchInfo:
43-
"""Generate LaunchInfo for nf-synstage."""
43+
"""Generate LaunchInfo for nf-synapse/synstage."""
4444
run_name = self.get_run_name("synstage")
4545
return LaunchInfo(
4646
run_name=run_name,
47-
pipeline="Sage-Bionetworks-Workflows/nf-synstage",
47+
pipeline="Sage-Bionetworks-Workflows/nf-synapse",
4848
revision="main",
49-
profiles=["sage"],
49+
profiles=["docker"],
50+
entry_name="NF_SYNSTAGE",
5051
params={
5152
"input": samplesheet_uri,
5253
},
@@ -124,13 +125,13 @@ class TowerRnaseqFlow(FlowSpec):
124125
help="S3 prefix for storing output files from different runs",
125126
)
126127

127-
def get_staged_samplesheet(self, samplesheet: str) -> str:
128+
def get_staged_samplesheet(self, samplesheet: str, run_name: str) -> str:
128129
"""Generate staged samplesheet based on synstage behavior."""
129130
scheme, _, samplesheet_resource = samplesheet.partition("://")
130131
if scheme != "s3":
131132
raise ValueError("Expected an S3 URI.")
132133
path = PurePosixPath(samplesheet_resource)
133-
return f"{scheme}://{path.parent}/synstage/{path.name}"
134+
return f"{scheme}://{path.parent}/synstage/{run_name}/{path.name}"
134135

135136
def monitor_workflow(self, workflow_id):
136137
"""Monitor any workflow run (wait until done)."""
@@ -171,21 +172,23 @@ def transfer_samplesheet_to_s3(self):
171172

172173
@step
173174
def launch_synstage(self):
174-
"""Launch nf-synstage to stage Synapse files in samplesheet."""
175+
"""Launch nf-synapse/synstage to stage Synapse files in samplesheet."""
175176
launch_info = self.dataset.synstage_info(self.samplesheet_uri)
176177
self.synstage_id = self.tower.launch_workflow(launch_info, "spot")
177178
self.next(self.monitor_synstage)
178179

179180
@step
180181
def monitor_synstage(self):
181-
"""Monitor nf-synstage workflow run (wait until done)."""
182+
"""Monitor nf-synapse/synstage workflow run (wait until done)."""
182183
self.monitor_workflow(self.synstage_id)
183184
self.next(self.launch_rnaseq)
184185

185186
@step
186187
def launch_rnaseq(self):
187188
"""Launch nf-core/rnaseq workflow to process RNA-seq data."""
188-
staged_uri = self.get_staged_samplesheet(self.samplesheet_uri)
189+
staged_uri = self.get_staged_samplesheet(
190+
self.samplesheet_uri, self.dataset.get_run_name("synstage")
191+
)
189192
launch_info = self.dataset.rnaseq_info(staged_uri, self.rnaseq_outdir)
190193
self.rnaseq_id = self.tower.launch_workflow(launch_info, "spot")
191194
self.next(self.monitor_rnaseq)

0 commit comments

Comments
 (0)