The step in this section should only have to be performed once. toil-box
images will have to be updated following major releases of Toil that contain
critical features or bug fixes, or at the direction of the Toil development
team.
Comprehensive help can be found at CGCloud. The following steps are provided for reference but it is recommended to read the source documentation to gain full understanding on how to use CGCloud.
virtualenv ~/cgcloudsource cgcloud/bin/activatepip install cgcloud-toil==1.5.5deactivateln -snf ~/cgcloud/bin/activate ~/bin/- Configure the CGCLoud environment in
~/.profileas explained in the CGCloud Core README, specificallyexport CGCLOUD_ZONE=us-west-2aexport CGCLOUD_PLUGINS="cgcloud.toil:$CGCLOUD_PLUGINS"export CGCLOUD_KEYPAIRS="__me__ @@developers"
- Finally, create the image:
cgcloud create -IT toil-box
Launch the leader instance first, and grow the cluster by adding workers after starting the pipeline.
The directory passed to the --share option—keydir in this example—should
contain the master key if encryption is being used:
cgcloud create-cluster --leader-instance-type m3.medium \
--share=key_dir/ \
--num-workers 0 \
--cluster-name example \
toil
SSH into leader, install toil-scripts, setup config and manifest, grow cluster, and start run.
-
cgcloud ssh --cluster-name example toil-leader -
virtualenv --system-site-packages toil-scripts -
source toil-scripts/bin/activate -
pip install toil-scripts==2.0.11 -
Generate a config and manifest file, e.g.
toil-rnaseq generate. These files will be generated in the current working directory, and contain instructions on how to fill them in. -
exitso we can grow the cluster, in this case by 22 workers, ideally using spot market. -
cgcloud grow-cluster --cluster-name example --spot-auto-zone --spot-bid 1.00 --instance-type c3.8xlarge --num-workers 22 toil -
cgcloud ssh --cluster-name example toil-leader -
Place boto or AWS credentials on the leader (
~/.botoor~/.aws/credentials)
Start a screen session
screen -S example-run
Activate the virtualenv
source toil-scripts/bin/activate
Run the pipeline:
toil-rnaseq run --retryCount=1 \
--workDir=/var/lib/toil \
--batchSystem=mesos \
--mesosMaster=mesos-master:5050 \
--sseKey=/home/mesosbox/shared/master.key \
aws:us-west-2:jvivian-example-run-1 \
2>&1 | tee 1.log
What do the above options mean:
-
Passing
--retryCount=1ensures that samples that failed for non-deterministic reasons have a chance to retry. A retry count of 1 means that any job will be attempted at most two times, one initial attempt, and one retry. If both attempts fail for one or more jobs in the pipeline, the entire pipeline will fail. You can retry the entire pipeline again, possible after modifying by adding--restartto the above command. -
The
--workDir=/var/lib/toiloption sets the work directory to the RAID-mounted ephemeral drives. This is a workaround for DataBiosphere/toil#1154. Once that is fixed, this option can be removed. -
--batchSystem=mesosand--mesosMaster=mesos-master:5050configures the Mesos batchsystem that must be used on clusters in EC2. -
The
--sseKeyoption 1) enables encryption of intermediate files in Toil's job store and 2) configures the secret key to encrypt those files with. This is needed if the samples are access controlled for patient privacy. The key specified here is also used to 3) decrypt inputs to the pipeline and encrypt its outputs. For this to work, the inputs must have been uploaded with s3am's--sse-key-is-master. -
The final positional argument (
aws:us-west-2:jvivian-example-run-1) specifies the job store location. On EC2, you almost always want to useaws:followed by the region your cluster is in. If you don't, the workflow is going to be slow and additional data transfer charges will be incurred. The part after that (jvivian-example-run-1) is a matter of choice, but be sure to prefix it with your name so we can track which job store belongs to which user.
In the CGCloud grow-cluster command, we added two optional arguments:
--spot-auto-zone and --spot-bid <PRICE>. Clusters can be run not on the
spot market, but it is much cheaper to do so and should be done whenever
possible.
At the EC2 console on AWS, click the Spot Requests link on the left-hand side of the page. Then click on Pricing History and select the instance type you're interested in, which is typically the c3.8xlarge or r3.8xlarge.
A general rule of thumb is to look at a day or a couple days' worth of prices and select a price that is lower than the on-demand price but high enough above the spot market price that your'e not at risk of getting your nodes shut down.
If a pipeline finishes and said there were failed jobs, then investigate the logfile on the leader:
-
less 1.log -
Type the forward slash
/and thenTOILand hit enter. This will take you to the first error log. -
See if you can determine why the sample failed. If the error appears to have a non-deterministic cause, like an HTTP error, then restarting the pipeline is the optimal option.
-
Rerun the
toil-rnaseq run ...command from above, but add the--restartflag and change the log name to2.log.
If the issue looks like it is systemic but based on resources (not enough
memory or not enough disk), then please change the instance type to something
that is appropriate for the size of the sample you are trying to run, or
increase the size of the disk config option.
If the pipeline fails despite a --restart attempt, please copy the entire
error message into a ticket. Include the config and manifest and if possible
the path to the sample that failed.