Merge pull request #33 from pepkit/dev

nsheff · web-flow · commit de1df991d121 · 2019-06-21T11:48:10.000-04:00
0.6.0
diff --git a/.gitignore b/.gitignore
@@ -84,4 +84,6 @@ peppy.egg-info/
 
 #ipynm checkpoints
 *ipynb_checkpoints*
-*.egg-info*
+*.egg-info*
+
+
diff --git a/docs/README.md b/docs/README.md
@@ -2,26 +2,25 @@
 
 [![PEP compatible](http://pepkit.github.io/img/PEP-compatible-green.svg)](http://pepkit.github.io)
 
-`geofetch` is a command-line tool that does two things when given one or more GEO/SRA accessions:
+`geofetch` is a command-line tool that downloads and organizes data and metadata from GEO and SRA. When given one or more GEO/SRA accessions, `geofetch` will:
 
-  - Downloads either raw or processed data from either GEO or SRA
+  - Download either raw or processed data from either GEO or SRA
   - Produces a standardized [PEP](http://pepkit.github.io) sample annotation sheet of public metadata. This makes it really easy to run [looper](https://pepkit.github.io/docs/looper/)-compatible pipelines on public datasets by handling data acquisition and metadata formatting and standardization for you.
 
 You can use it with the [sra_convert](http://github.com/pepkit/sra_convert) pipeline, a [pypiper](http://pypiper.readthedocs.io) pipeline that converts SRA files into BAM files.
 
+## Quick demo
 
-## Installing
+`geofetch` runs on the command line. This command will download the metadata for the given GSE number.
 
-```bash
-pip install geofetch
+```console
+geofetch -i GSE95654
 ```
 
-## Quick start
-
-Now, run it on the command line:
+You can add `--just-metadata` if you don't want to download the raw SRA files.
 
 ```console
-geofetch --help
+geofetch -i GSE95654 --just-metadata
 ```
 
-Next, check out the [usage](usage) reference, or for a detailed walkthrough, head on over to the [tutorial](tutorial).
+For more details, check out the [usage](usage.md) reference, [installation instructions](install.md), or head on over to the [tutorial](tutorial.md) for a detailed walkthrough.
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -1,5 +1,12 @@
 # Changelog
 
+## [Unreleased] 
+
+## [0.6.0] -- 2019-06-20
+- Fixed a bug with specifying a processed data output folder
+- Added a pre-check and warning message for `prefetch` command 
+
+
 ## [0.5.0] -- 2019-05-09
 
 - `geofetch` will now re-try a failed prefetch 3 times and warn if unsuccessful.
diff --git a/docs/file-specification.md b/docs/file-specification.md
@@ -0,0 +1,20 @@
+#  How to specify samples to download
+
+The command-line interface provides a way to give GSE or SRA accession IDs. By default, `geofetch` will download all the samples it can find in the accession you give it. What if you want to restrict the download to just a few samples? Or what if you want to combine samples from multiple accessions? If you want more control, either because you have multiple accessions or you want to specify a subset of samples, then you can use the *file-based sample specification*, in which you provide `geofetch` with a file listing your GSE/GSM accessions.
+
+## The file-based sample specification
+
+
+Create a file with 3 columns that correspond to `GSE`, `GSM`, and `Sample_name. You may mix 1, 2, and 3 column lines in the file. An example input file could look like this:
+
+```console
+GSE123  GSM#### Sample1
+GSE123  GSM#### Sample2
+GSE123  GSM####
+GSE456
+```
+
+By default, `geofetch` will download all the samples in every included accession, but you can limit this by adding a second column with **GSM accessions** (which specify individual samples with a **GSE dataset**). If the second column is included, a third column may also be included and will be used
+as the sample_name; otherwise, the sample will be named according to the GEO Sample_title field. Any columns after the third will be ignored.
+
+This will download 3 particular GSM experiments from GSE123, and everything from GSE456. It will name the first two samples Sample1 and Sample2, and the third, plus any from GSE456, will have names according to GEO metadata.
diff --git a/docs/howto-limit.md b/docs/howto-limit.md
diff --git a/docs/howto.md b/docs/howto.md
diff --git a/docs/install.md b/docs/install.md
@@ -0,0 +1,37 @@
+# Installing geofetch
+
+## Prerequisites
+
+You must have the [sratoolkit from NCBI](https://www.ncbi.nlm.nih.gov/books/NBK158900/) installed, with the tools in your PATH. Once it's installed, you should check to make sure you can run `prefetch`. Also, make sure it's configured to store SRA files where you want them. For more information, see how to change sratools download location.
+
+## Setting data download location for `sratools`
+
+`geofetch` is using the [sratoolkit](https://trace.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=std) to download raw data from SRA -- which means it's stuck with the [default path for downloading SRA data](http://databio.org/posts/downloading_sra_data.html), which is in your home directory. So before you run `geofetch`, make sure you have set up your download location to the correct place. In our group, we use a shared group environment variable called `${SRARAW}`, which points to a shared folder (`${DATA}/sra`) where the whole group has access to downloaded SRA data. You can point the `sratoolkit` (and therefore `geofetch`) to use that location with this one-time configuration code:
+
+```
+echo "/repository/user/main/public/root = \"$DATA\"" > ${HOME}/.ncbi/user-settings.mkfg
+```
+
+Now `sratoolkit` will download data into an `/sra` folder in `${DATA}`, which is what `${SRARAW}` points to.
+
+If you are getting an error that the `.ncbi` folder does not exist in your home directory, you can just make a folder `.ncbi` with an empty file `user-settings.mkfg` and follow the same command above.
+
+## Installing geofetch
+
+Releases are posted as [GitHub releases](https://github.com/pepkit/geofetch/releases), or you can install from PyPI using `pip`:
+
+```bash
+pip install geofetch
+```
+
+Confirm it was successful by running it on the command line:
+
+```console
+geofetch --help
+```
+
+If the executable in not in your $PATH, append this to your `.bashrc` or `.profile` (or `.bash_profile` on macOS):
+
+```
+export PATH=~/.local/bin:$PATH
+```
diff --git a/docs/metadata_output.md b/docs/metadata_output.md
@@ -0,0 +1,19 @@
+# Metadata output
+
+For each GSE input accession (ACC), `geofetch` produces:
+
+- GSE_ACC####.soft a SOFT file (annotating the experiment itself)
+- GSM_ACC####.soft a SOFT file (annotating the samples within the experiment)
+- SRA_ACC####.soft a CSV file (annotating each SRA Run, retrieved from GSE->GSM->SRA)
+
+In addition, a single combined metadata file ("annoComb") for the whole input,
+including SRA and GSM annotations for each sample. Here, "combined" means that it will have
+rows for every sample in every GSE included in your input. So if you just gave a single GSE,
+then the combined file is the same as the GSE file. If any "merged" samples exist
+(samples for which there are multiple SRR Runs for a single SRX `Experiment`), the
+script will also produce a merge table CSV file with the relationships between
+SRX and SRR.
+
+The way this works: Starting from a GSE, select a subset of samples (GSM Accessions) provided, 
+and then obtain the SRX identifier for each of these from GEO. Now, query SRA for these SRX 
+accessions and get any associated SRR accessions. Finally, download all of these SRR data files.
diff --git a/docs/tutorial.md b/docs/tutorial.md
@@ -1,12 +1,8 @@
 # <img src="../img/geofetch_logo.svg" class="img-header">  tutorial
 
-## Prerequisites
-
-You must have the `sratoolkit` from NCBI installed, with tools in your `PATH` (check to make sure you can run `prefetch`). Make sure it's configured to store `sra` files where you want them. For more information, see [how to change sratools download location](howto-location.md).
-
 ## Download SRA data using `geofetch`
 
-To see full options, see the help menu with:
+Before starting, make sure you've followed the [installation instructions](install.md). To see your options, display the help menu:
 
 ```console
 geofetch -h
@@ -24,11 +20,17 @@ This will do 3 things:
 2. produce a sample annotation sheet, `PROJECT_NAME_annotation.csv`, in your metadata folder
 3. produce a project configuration file, `PROJECT_NAME_config.yaml`, in your metadata folder.
 
+Complete details about geofetch outputs is cataloged in the [metadata outputs reference](metadata_output.md).
 
 ## Finalize the project config and sample annotation
 
 That's basically it! `Geofetch` will have produced a general-purpose PEP for you, but you'll need to modify it for whatever purpose you have. For example, one common thing is to link to the pipeline you want to use by adding a `pipeline_interface` to the project config file. You may also need to adjust the `sample_annotation` file to make sure you have the right column names and attributes needed by the pipeline you're using. GEO submitters are notoriously bad at getting the metadata correct.
 
+## Selecting samples to download.
+
+By default, `geofetch` downloads all the data for one accession of interest. If you need more fine-grained control, either because you have multiple accessions or you need a subset of samples within them, you can use the [file-based sample specification](file-specification.md).
+
+
 ## A few real-world examples
 
 ```console
diff --git a/geofetch/_version.py b/geofetch/_version.py
@@ -1,2 +1 @@
-__version__ = "0.5.0"
-
+__version__ = "0.6.0"
diff --git a/geofetch/geofetch.py b/geofetch/geofetch.py
@@ -35,7 +35,7 @@
 from ._version import __version__
 
 from logmuse import add_logging_options, logger_via_cli
-from ubiquerg import expandpath
+from ubiquerg import expandpath, is_command_callable
 
 
 _LOGGER = None
@@ -303,10 +303,17 @@ def update_columns(metadata, experiment_name, sample_name, read_type):
 def run_geofetch(cmdl):
     """ Main script driver/workflow """
 
+
+
     args = _parse_cmdl(cmdl)
     global _LOGGER
     _LOGGER = logger_via_cli(args, name="geofetch")
 
+    # check to make sure prefetch is callable
+    if not args.just_metadata and not args.processed:
+        if not is_command_callable("prefetch"):
+            raise SystemExit("You must first install the sratoolkit, with prefetch in your PATH.")
+
     if args.name:
         project_name = args.name
     else:
@@ -456,8 +463,8 @@ def render_env_var(ev):
                     file_url = pl[pl.keys()[0]].rstrip()
                     _LOGGER.info("File: " + str(file_url))
                     # download file
-                    if args.geofolder:
-                        data_folder = os.path.join(args.geofolder, acc_GSE)
+                    if args.geo_folder:
+                        data_folder = os.path.join(args.geo_folder, acc_GSE)
                         print(file_url, data_folder)
                         subprocess.call(['wget', file_url, '-P', data_folder])
 
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -7,15 +7,15 @@ pypi_name: geofetch
 nav:
   - Getting started:
     - Introduction: README.md
+    - Install and configure: install.md
     - Tutorial: tutorial.md
   - How-to Guides:
-    - How to specify multiple accessions: howto-limit.md
-    - How to control data location: howto-location.md
+    - Specifying samples to download: file-specification.md
   - Reference:
+    - Metadata output: metadata_output.md
     - Usage: usage.md
-    - API: autodoc_build/geofetch.md
     - FAQ: faq.md
-    - Support: support.md
+    - Support: http://github.com/pepkit/geofetch/issues
     - Contributing: contributing.md
     - Changelog: changelog.md
 
diff --git a/requirements/requirements-all.txt b/requirements/requirements-all.txt
@@ -2,4 +2,4 @@ attmap>=0.1.8
 colorama>=0.3.9
 peppy>=0.19.1
 logmuse>=0.0.2
-ubiquerg>=0.1
+ubiquerg>=0.4.4

Original file line number	Diff line number	Diff line change
`@@ -1,2 +1 @@`
`1`		`-__version__ = "0.5.0"`
`2`		`-`
	`1`	`+__version__ = "0.6.0"`