Skip to content

Pipeline stalls when prefetch can't reach a sample #5

Open
@rabdill

Description

@rabdill

Example: PRJEB6518 has 534 samples in the database, but some downloads in the prefetch step didn't go according to plan:

2023-01-07T04:23:15 prefetch.3.0.1: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2023-01-07T04:23:15 prefetch.3.0.1: 295) Downloading 'ERR531601.sralite'...
2023-01-07T04:23:15 prefetch.3.0.1: SRA Lite file is being retrieved, if this is different from your preference, it may be due to current file availability.
2023-01-07T04:23:15 prefetch.3.0.1:  Downloading via HTTPS...
2023-01-07T04:23:22 prefetch.3.0.1:  HTTPS download succeed
2023-01-07T04:23:22 prefetch.3.0.1:  'ERR531601.sralite' is valid
2023-01-07T04:23:22 prefetch.3.0.1: 295) 'ERR531601.sralite' was downloaded successfully

It downloads the sralite version because of "current file availability." (This might also happen with 404s, as we've run into previously.) We need a way to:

  1. determine how many samples is "enough" to move forward, and more worryingly,
  2. convince snakemake to move forward with however many there are.

My first guess for how to tackle this would be to try to extract the names of the failed samples, then rewrite the SraAccList.txt file to exclude them. This would be easier for the sralite thing above than for the 404s because the "failed" downloads are left behind: When the prefetch job fails, it removes all the downloaded .sra files, but it doesn't remove the .sralite files that it wasn't expecting to see. So we can look in the project directory for which directories are still there, then remove those samples from the list.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions