Skip to content

Commit 5f3a296

Browse files
authored
mostly linting, a little cleanup (#197)
1 parent a4cb32a commit 5f3a296

11 files changed

+123
-62
lines changed

documentation/DCP-documentation/AWS_hygiene_scripts.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,8 @@ while True:
2323
alarms = client.describe_alarms(AlarmTypes=['MetricAlarm'],StateValue='INSUFFICIENT_DATA',NextToken=token)
2424
```
2525

26-
# Clean out old log groups
26+
## Clean out old log groups
27+
2728
Bash:
2829

2930
```sh

documentation/DCP-documentation/advanced_configuration.md

+8-5
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ We've tried very hard to make Distributed-CellProfiler light and adaptable, but
44
Below is a non-comprehensive list of places where you can adapt the code to your own purposes.
55

66
***
7+
78
## Changes you can make to Distributed-CellProfiler outside of the Docker container
89

910
* **Location of ECS configuration files:** By default these are placed into your bucket with a prefix of 'ecsconfigs/'.
@@ -29,14 +30,16 @@ This value can be modified in run.py .
2930
* **Distributed-CellProfiler version:** At least CellProfiler version 4.2.4, and use the DOCKERHUB_TAG in config.py as `bethcimini/distributed-cellprofiler:2.1.0_4.2.4_plugins`.
3031
* **Custom model: If using a [custom User-trained model](https://cellpose.readthedocs.io/en/latest/models.html) generated using Cellpose, add the model file to S3.
3132
We use the following structure to organize our files on S3.
32-
```
33+
34+
```text
3335
└── <project_name>
3436
   └── workspace
3537
     └── model
3638
      └── custom_model_filename
3739
```
38-
* **RunCellpose module:**
39-
* Inside RunCellpose, select the "custom" Detection mode.
40-
In "Location of the pre-trained model file", enter the mounted bucket path to your model.
40+
41+
* **RunCellpose module:**
42+
* Inside RunCellpose, select the "custom" Detection mode.
43+
In "Location of the pre-trained model file", enter the mounted bucket path to your model.
4144
e.g. **/home/ubuntu/bucket/projects/<project_name>/workspace/model/**
42-
* In "Pre-trained model file name", enter your custom_model_filename
45+
* In "Pre-trained model file name", enter your custom_model_filename

documentation/DCP-documentation/external_buckets.md

+18-8
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# Using External Buckets
2-
Distributed-CellProfiler can read and/or write to/from an external S3 bucket (i.e. a bucket not in the same account as you are running DCP).
2+
3+
Distributed-CellProfiler can read and/or write to/from an external S3 bucket (i.e. a bucket not in the same account as you are running DCP).
34
To do so, you will need to appropriately set your configuration in run.py.
45
You may need additional configuration in AWS Identity and Access Management (IAM).
56

@@ -21,42 +22,50 @@ If you don't need to add UPLOAD_FLAGS, keep it as the default ''.
2122
## Example configs
2223

2324
### Reading from the Cell Painting Gallery
24-
```
25+
26+
```python
2527
AWS_REGION = 'your-region' # e.g. 'us-east-1'
2628
AWS_PROFILE = 'default' # The same profile used by your AWS CLI installation
2729
SSH_KEY_NAME = 'your-key-file.pem' # Expected to be in ~/.ssh
2830
AWS_BUCKET = 'bucket-name' # Your bucket
2931
SOURCE_BUCKET = 'cellpainting-gallery'
32+
WORKSPACE_BUCKET = 'bucket-name' # Likely your bucket
3033
DESTINATION_BUCKET = 'bucket-name' # Your bucket
3134
UPLOAD_FLAGS = ''
3235
```
3336

3437
### Read/Write to a collaborator's bucket
35-
```
38+
39+
```python
3640
AWS_REGION = 'your-region' # e.g. 'us-east-1'
3741
AWS_PROFILE = 'role-permissions' # A profile with the permissions setup described above
3842
SSH_KEY_NAME = 'your-key-file.pem' # Expected to be in ~/.ssh
3943
AWS_BUCKET = 'bucket-name' # Your bucket
4044
SOURCE_BUCKET = 'collaborator-bucket'
45+
WORKSPACE_BUCKET = 'collaborator-bucket'
4146
DESTINATION_BUCKET = 'collaborator-bucket'
42-
UPLOAD_FLAGS = '--acl bucket-owner-full-control --metadata-directive REPLACE'
47+
UPLOAD_FLAGS = '--acl bucket-owner-full-control --metadata-directive REPLACE' # Examples of flags that may be necessary
4348
```
4449

4550
## Permissions setup
51+
4652
If you are reading from a public bucket, no additional setup is necessary.
53+
Note that, depending on the configuration of that bucket, you may not be able to mount the public bucket so you will need to set `DOWNLOAD_FILES='True'`.
4754

48-
If you are reading from a non-public bucket or writing to a bucket, you wil need further permissions setup.
55+
If you are reading from a non-public bucket or writing to a bucket that is not yours, you wil need further permissions setup.
4956
Often, access to someone else's AWS account is handled through a role that can be assumed.
5057
Learn more about AWS IAM roles [here](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html).
5158
Your collaborator will define the access limits of the role within their AWS IAM.
5259
You will also need to define role limits within your AWS IAM so that when you assume the role (giving you access to your collaborator's resource), that role also has the appropriate permissions to run DCP.
5360

5461
### In your AWS account
62+
5563
In AWS IAM, for the role that has external bucket access, you will need to add all of the DCP permissions described in [Step 0](step_0_prep.md).
5664

57-
You will also need to edit the trust relationship for the role so that ECS and EC2 can assume the role.
65+
You will also need to edit the trust relationship for the role so that ECS and EC2 can assume the role.
5866
A template is as follows:
59-
```
67+
68+
```json
6069
{
6170
"Version": "2012-10-17",
6271
"Statement": [
@@ -80,6 +89,7 @@ A template is as follows:
8089
```
8190

8291
### In your DCP instance
92+
8393
DCP reads your AWS_PROFILE from your [control node](step_0_prep.md#the-control-node).
8494
Edit your AWS CLI configuration files for assuming that role in your control node as follows:
8595

@@ -95,4 +105,4 @@ In `~/.aws/credentials`, copy in the following text block at the bottom of the f
95105

96106
[my-account-profile]
97107
aws_access_key_id = ACCESS_KEY
98-
aws_secret_access_key = SECRET_ACCESS_KEY
108+
aws_secret_access_key = SECRET_ACCESS_KEY

documentation/DCP-documentation/overview.md

+2
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
**How do I run CellProfiler on Amazon?** Use Distributed-CellProfiler!
44

55
Distributed-CellProfiler is a series of scripts designed to help you run a Dockerized version of CellProfiler on [Amazon Web Services](https://aws.amazon.com/) (AWS) using AWS's file storage and computing systems.
6+
67
* Data is stored in S3 buckets.
78
* Software is run on "Spot Fleets" of computers (or instances) in the cloud.
89

@@ -12,6 +13,7 @@ Docker is a software platform that packages software into containers.
1213
In a container is the software that you want to run as well as everything needed to run it (e.g. your software source code, operating system libraries, and dependencies).
1314

1415
Dockerizing a workflow has many benefits including
16+
1517
* Ease of use: Dockerized software doesn't require the user to install anything themselves.
1618
* Reproducibility: You don't need to worry about results being affected by the version of your software or its dependencies being used as those are fixed.
1719

documentation/DCP-documentation/overview_2.md

+12-7
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## What happens in AWS when I run Distributed-CellProfiler?
1+
# What happens in AWS when I run Distributed-CellProfiler?
22

33
The steps for actually running the Distributed-CellProfiler code are outlined in the repository [README](https://github.com/DistributedScience/Distributed-CellProfiler/blob/master/README.md), and details of the parameters you set in each step are on their respective Documentation pages ([Step 1: Config](step_1_configuration.md), [Step 2: Jobs](step_2_submit_jobs.md), [Step 3: Fleet](step_3_start_cluster.md), and optional [Step 4: Monitor](step_4_monitor.md)).
44
We'll give an overview of what happens in AWS at each step here and explain what AWS does automatically once you have it set up.
@@ -8,6 +8,7 @@ We'll give an overview of what happens in AWS at each step here and explain what
88
**Step 1**:
99
In the Config file you set quite a number of specifics that are used by EC2, ECS, SQS, and in making Dockers.
1010
When you run `$ python3 run.py setup` to execute the Config, it does three major things:
11+
1112
* Creates task definitions.
1213
These are found in ECS.
1314
They define the configuration of the Dockers and include the settings you gave for **CHECK_IF_DONE_BOOL**, **DOCKER_CORES**, **EXPECTED_NUMBER_FILES**, and **MEMORY**.
@@ -25,6 +26,7 @@ In the Config file you set the number and size of the EC2 instances you want.
2526
This information, along with account-specific configuration in the Fleet file is used to start the fleet with `$ python3 run.py startCluster`.
2627

2728
**After these steps are complete, a number of things happen automatically**:
29+
2830
* ECS puts Docker containers onto EC2 instances.
2931
If there is a mismatch within your Config file and the Docker is larger than the instance it will not be placed.
3032
ECS will keep placing Dockers onto an instance until it is full, so if you accidentally create instances that are too large you may end up with more Dockers placed on it than intended.
@@ -59,6 +61,7 @@ Read more about this and other configurations in [Step 1: Configuration](step_1_
5961
## How do I determine my configuration?
6062

6163
To some degree, you determine the best configuration for your needs through trial and error.
64+
6265
* Looking at the resources your software uses on your local computer when it runs your jobs can give you a sense of roughly how much hard drive and memory space each job requires, which can help you determine your group size and what machines to use.
6366
* Prices of different machine sizes fluctuate, so the choice of which type of machines to use in your spot fleet is best determined at the time you run it.
6467
How long a job takes to run and how quickly you need the data may also affect how much you're willing to bid for any given machine.
@@ -67,12 +70,14 @@ However, you're also at a greater risk of running out of hard disk space.
6770

6871
Keep an eye on all of the logs the first few times you run any workflow and you'll get a sense of whether your resources are being utilized well or if you need to do more tweaking.
6972

70-
## What does this look like on AWS?
73+
## What does this look like on AWS?
74+
7175
The following five are the primary resources that Distributed-CellProfiler interacts with.
7276
After you have finished [preparing for Distributed-CellProfiler](step_0_prep), you do not need to directly interact with any of these services outside of Distributed-CellProfiler.
7377
If you would like a granular view of what Distributed-CellProfiler is doing while it runs, you can open each console in a separate tab in your browser and watch their individual behaviors, though this is not necessary, especially if you run the [monitor command](step_4_monitor.md) and/or have DS automatically create a Dashboard for you (see [Configuration](step_1_configuration.md)).
74-
* [S3 Console](https://console.aws.amazon.com/s3)
75-
* [EC2 Console](https://console.aws.amazon.com/ec2/)
76-
* [ECS Console](https://console.aws.amazon.com/ecs/)
77-
* [SQS Console](https://console.aws.amazon.com/sqs/)
78-
* [CloudWatch Console](https://console.aws.amazon.com/cloudwatch/)
78+
79+
* [S3 Console](https://console.aws.amazon.com/s3)
80+
* [EC2 Console](https://console.aws.amazon.com/ec2/)
81+
* [ECS Console](https://console.aws.amazon.com/ecs/)
82+
* [SQS Console](https://console.aws.amazon.com/sqs/)
83+
* [CloudWatch Console](https://console.aws.amazon.com/cloudwatch/)

documentation/DCP-documentation/passing_files_to_DCP.md

+12-11
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,13 @@ Distributed-CellProfiler can be told what files to use through LoadData.csv, Bat
44

55
## Metadata use in DCP
66

7-
Distributed-CellProfiler requires metadata and grouping in order to split jobs.
8-
This means that, unlikely a generic CellProfiler workflow, the inclusion of metadata and grouping are NOT optional for pipelines you wish to use in Distributed-CellProfiler.
9-
- If using LoadData, this means ensuring that your input CSV has some metadata to use for grouping and "Group images by metdata?" is set to "Yes".
10-
- If using batch files or file lists, this means ensuring that the Metadata and Groups modules are enabled, and that you are extracting metadata from file and folder names _that will also be present in your remote system_ in the Metadata module in your CellProfiler pipeline.
11-
You can pass additional metadata to CellProfiler by `Add another extraction method`, setting the method to `Import from file` and setting Metadata file location to `Default Input Folder`.
12-
Metadata of either type can be used for grouping.
7+
Distributed-CellProfiler requires metadata and grouping in order to split jobs.
8+
This means that, unlikely a generic CellProfiler workflow, the inclusion of metadata and grouping are NOT optional for pipelines you wish to use in Distributed-CellProfiler.
9+
10+
- If using LoadData, this means ensuring that your input CSV has some metadata to use for grouping and "Group images by metdata?" is set to "Yes".
11+
- If using batch files or file lists, this means ensuring that the Metadata and Groups modules are enabled, and that you are extracting metadata from file and folder names _that will also be present in your remote system_ in the Metadata module in your CellProfiler pipeline.
12+
You can pass additional metadata to CellProfiler by `Add another extraction method`, setting the method to `Import from file` and setting Metadata file location to `Default Input Folder`.
13+
Metadata of either type can be used for grouping.
1314

1415
## Load Data
1516

@@ -25,14 +26,14 @@ Some users have reported issues with using relative paths in the PathName column
2526
You can create this CSV yourself via your favorite scripting language.
2627
We maintain a script for creating LoadData.csv from Phenix metadata XML files called [pe2loaddata](https://github.com/broadinstitute/pe2loaddata).
2728

28-
You can also create the LoadData.csv in a local copy of CellProfiler using the standard input modules of Images, Metadata, NamesAndTypes and Groups.
29+
You can also create the LoadData.csv in a local copy of CellProfiler using the standard input modules of Images, Metadata, NamesAndTypes and Groups.
2930
More written and video information about using the input modules can be found [here](broad.io/CellProfilerInput).
3031
After loading in your images, use the `Export`->`Image Set Listing` command.
3132
You will then need to replace the local paths with the paths where the files can be found in S3 which is hardcoded to `/home/ubuntu/bucket`.
3233
If your files are nested in the same structure, this can be done with a simple find and replace in any text editing software.
3334
(e.g. Find '/Users/eweisbar/Desktop' and replace with '/home/ubuntu/bucket')
3435

35-
More detail: The [Dockerfile](https://github.com/DistributedScience/Distributed-CellProfiler/blob/master/worker/Dockerfile) is the first script to execute in the Docker.
36+
More detail: The [Dockerfile](https://github.com/DistributedScience/Distributed-CellProfiler/blob/master/worker/Dockerfile) is the first script to execute in the Docker.
3637
It creates the `/home/ubuntu/` folder and then executes [run_worker.sh](https://github.com/DistributedScience/Distributed-CellProfiler/blob/master/worker/run-worker.sh) from that point.
3738
run_worker.sh makes `/home/ubuntu/bucket/` and uses S3FS to mount your S3 bucket at that location. (If you set `DOWNLOAD_FILES='True'` in your [config](step_1_configuration.md), then the S3FS mount is bypassed but files are downloaded locally to the `/home/ubuntu/bucket` path so that the paths are the same as if it was S3FS mounted.)
3839

@@ -53,7 +54,7 @@ To use a batch file, your data needs to have the same structure in the cloud as
5354

5455
### Creating batch files
5556

56-
To create a batch file, load all your images into a local copy of CellProfiler using the standard input modules of Images, Metadata, NamesAndTypes and Groups.
57+
To create a batch file, load all your images into a local copy of CellProfiler using the standard input modules of Images, Metadata, NamesAndTypes and Groups.
5758
More written and video information about using the input modules can be found [here](broad.io/CellProfilerInput).
5859
Put the `CreateBatchFiles` module at the end of your pipeline and ensure that it is selected.
5960
Add a path mapping and edit the `Local root path` and `Cluster root path`.
@@ -71,8 +72,8 @@ Note that if you do not follow our standard file organization, under **#not proj
7172

7273
## File lists
7374

74-
You can also simply pass a list of absolute file paths (not relative paths) with one file per row in `.txt` format.
75-
These must be the absolute paths that Distributed-CellProfiler will see, aka relative to the root of your bucket (which will be mounted as `/bucket`.
75+
You can also simply pass a list of absolute file paths (not relative paths) with one file per row in `.txt` format.
76+
These must be the absolute paths that Distributed-CellProfiler will see, aka relative to the root of your bucket (which will be mounted as `/bucket`.
7677

7778
### Creating File Lists
7879

0 commit comments

Comments
 (0)