Skip to content

Commit de1df99

Browse files
authored
Merge pull request #33 from pepkit/dev
0.6.0
2 parents 652231d + c63b792 commit de1df99

File tree

13 files changed

+118
-106
lines changed

13 files changed

+118
-106
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,4 +84,6 @@ peppy.egg-info/
8484

8585
#ipynm checkpoints
8686
*ipynb_checkpoints*
87-
*.egg-info*
87+
*.egg-info*
88+
89+

docs/README.md

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,26 +2,25 @@
22

33
[![PEP compatible](http://pepkit.github.io/img/PEP-compatible-green.svg)](http://pepkit.github.io)
44

5-
`geofetch` is a command-line tool that does two things when given one or more GEO/SRA accessions:
5+
`geofetch` is a command-line tool that downloads and organizes data and metadata from GEO and SRA. When given one or more GEO/SRA accessions, `geofetch` will:
66

7-
- Downloads either raw or processed data from either GEO or SRA
7+
- Download either raw or processed data from either GEO or SRA
88
- Produces a standardized [PEP](http://pepkit.github.io) sample annotation sheet of public metadata. This makes it really easy to run [looper](https://pepkit.github.io/docs/looper/)-compatible pipelines on public datasets by handling data acquisition and metadata formatting and standardization for you.
99

1010
You can use it with the [sra_convert](http://github.com/pepkit/sra_convert) pipeline, a [pypiper](http://pypiper.readthedocs.io) pipeline that converts SRA files into BAM files.
1111

12+
## Quick demo
1213

13-
## Installing
14+
`geofetch` runs on the command line. This command will download the metadata for the given GSE number.
1415

15-
```bash
16-
pip install geofetch
16+
```console
17+
geofetch -i GSE95654
1718
```
1819

19-
## Quick start
20-
21-
Now, run it on the command line:
20+
You can add `--just-metadata` if you don't want to download the raw SRA files.
2221

2322
```console
24-
geofetch --help
23+
geofetch -i GSE95654 --just-metadata
2524
```
2625

27-
Next, check out the [usage](usage) reference, or for a detailed walkthrough, head on over to the [tutorial](tutorial).
26+
For more details, check out the [usage](usage.md) reference, [installation instructions](install.md), or head on over to the [tutorial](tutorial.md) for a detailed walkthrough.

docs/changelog.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
# Changelog
22

3+
## [Unreleased]
4+
5+
## [0.6.0] -- 2019-06-20
6+
- Fixed a bug with specifying a processed data output folder
7+
- Added a pre-check and warning message for `prefetch` command
8+
9+
310
## [0.5.0] -- 2019-05-09
411

512
- `geofetch` will now re-try a failed prefetch 3 times and warn if unsuccessful.

docs/file-specification.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# How to specify samples to download
2+
3+
The command-line interface provides a way to give GSE or SRA accession IDs. By default, `geofetch` will download all the samples it can find in the accession you give it. What if you want to restrict the download to just a few samples? Or what if you want to combine samples from multiple accessions? If you want more control, either because you have multiple accessions or you want to specify a subset of samples, then you can use the *file-based sample specification*, in which you provide `geofetch` with a file listing your GSE/GSM accessions.
4+
5+
## The file-based sample specification
6+
7+
8+
Create a file with 3 columns that correspond to `GSE`, `GSM`, and `Sample_name. You may mix 1, 2, and 3 column lines in the file. An example input file could look like this:
9+
10+
```console
11+
GSE123 GSM#### Sample1
12+
GSE123 GSM#### Sample2
13+
GSE123 GSM####
14+
GSE456
15+
```
16+
17+
By default, `geofetch` will download all the samples in every included accession, but you can limit this by adding a second column with **GSM accessions** (which specify individual samples with a **GSE dataset**). If the second column is included, a third column may also be included and will be used
18+
as the sample_name; otherwise, the sample will be named according to the GEO Sample_title field. Any columns after the third will be ignored.
19+
20+
This will download 3 particular GSM experiments from GSE123, and everything from GSE456. It will name the first two samples Sample1 and Sample2, and the third, plus any from GSE456, will have names according to GEO metadata.

docs/howto-limit.md

Lines changed: 0 additions & 48 deletions
This file was deleted.

docs/howto.md

Lines changed: 0 additions & 32 deletions
This file was deleted.

docs/install.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Installing geofetch
2+
3+
## Prerequisites
4+
5+
You must have the [sratoolkit from NCBI](https://www.ncbi.nlm.nih.gov/books/NBK158900/) installed, with the tools in your PATH. Once it's installed, you should check to make sure you can run `prefetch`. Also, make sure it's configured to store SRA files where you want them. For more information, see how to change sratools download location.
6+
7+
## Setting data download location for `sratools`
8+
9+
`geofetch` is using the [sratoolkit](https://trace.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=std) to download raw data from SRA -- which means it's stuck with the [default path for downloading SRA data](http://databio.org/posts/downloading_sra_data.html), which is in your home directory. So before you run `geofetch`, make sure you have set up your download location to the correct place. In our group, we use a shared group environment variable called `${SRARAW}`, which points to a shared folder (`${DATA}/sra`) where the whole group has access to downloaded SRA data. You can point the `sratoolkit` (and therefore `geofetch`) to use that location with this one-time configuration code:
10+
11+
```
12+
echo "/repository/user/main/public/root = \"$DATA\"" > ${HOME}/.ncbi/user-settings.mkfg
13+
```
14+
15+
Now `sratoolkit` will download data into an `/sra` folder in `${DATA}`, which is what `${SRARAW}` points to.
16+
17+
If you are getting an error that the `.ncbi` folder does not exist in your home directory, you can just make a folder `.ncbi` with an empty file `user-settings.mkfg` and follow the same command above.
18+
19+
## Installing geofetch
20+
21+
Releases are posted as [GitHub releases](https://github.com/pepkit/geofetch/releases), or you can install from PyPI using `pip`:
22+
23+
```bash
24+
pip install geofetch
25+
```
26+
27+
Confirm it was successful by running it on the command line:
28+
29+
```console
30+
geofetch --help
31+
```
32+
33+
If the executable in not in your $PATH, append this to your `.bashrc` or `.profile` (or `.bash_profile` on macOS):
34+
35+
```
36+
export PATH=~/.local/bin:$PATH
37+
```

docs/metadata_output.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Metadata output
2+
3+
For each GSE input accession (ACC), `geofetch` produces:
4+
5+
- GSE_ACC####.soft a SOFT file (annotating the experiment itself)
6+
- GSM_ACC####.soft a SOFT file (annotating the samples within the experiment)
7+
- SRA_ACC####.soft a CSV file (annotating each SRA Run, retrieved from GSE->GSM->SRA)
8+
9+
In addition, a single combined metadata file ("annoComb") for the whole input,
10+
including SRA and GSM annotations for each sample. Here, "combined" means that it will have
11+
rows for every sample in every GSE included in your input. So if you just gave a single GSE,
12+
then the combined file is the same as the GSE file. If any "merged" samples exist
13+
(samples for which there are multiple SRR Runs for a single SRX `Experiment`), the
14+
script will also produce a merge table CSV file with the relationships between
15+
SRX and SRR.
16+
17+
The way this works: Starting from a GSE, select a subset of samples (GSM Accessions) provided,
18+
and then obtain the SRX identifier for each of these from GEO. Now, query SRA for these SRX
19+
accessions and get any associated SRR accessions. Finally, download all of these SRR data files.

docs/tutorial.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,8 @@
11
# <img src="../img/geofetch_logo.svg" class="img-header"> tutorial
22

3-
## Prerequisites
4-
5-
You must have the `sratoolkit` from NCBI installed, with tools in your `PATH` (check to make sure you can run `prefetch`). Make sure it's configured to store `sra` files where you want them. For more information, see [how to change sratools download location](howto-location.md).
6-
73
## Download SRA data using `geofetch`
84

9-
To see full options, see the help menu with:
5+
Before starting, make sure you've followed the [installation instructions](install.md). To see your options, display the help menu:
106

117
```console
128
geofetch -h
@@ -24,11 +20,17 @@ This will do 3 things:
2420
2. produce a sample annotation sheet, `PROJECT_NAME_annotation.csv`, in your metadata folder
2521
3. produce a project configuration file, `PROJECT_NAME_config.yaml`, in your metadata folder.
2622

23+
Complete details about geofetch outputs is cataloged in the [metadata outputs reference](metadata_output.md).
2724

2825
## Finalize the project config and sample annotation
2926

3027
That's basically it! `Geofetch` will have produced a general-purpose PEP for you, but you'll need to modify it for whatever purpose you have. For example, one common thing is to link to the pipeline you want to use by adding a `pipeline_interface` to the project config file. You may also need to adjust the `sample_annotation` file to make sure you have the right column names and attributes needed by the pipeline you're using. GEO submitters are notoriously bad at getting the metadata correct.
3128

29+
## Selecting samples to download.
30+
31+
By default, `geofetch` downloads all the data for one accession of interest. If you need more fine-grained control, either because you have multiple accessions or you need a subset of samples within them, you can use the [file-based sample specification](file-specification.md).
32+
33+
3234
## A few real-world examples
3335

3436
```console

geofetch/_version.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1 @@
1-
__version__ = "0.5.0"
2-
1+
__version__ = "0.6.0"

0 commit comments

Comments
 (0)