Skip to content

a smaller test dataset #33

Open
Open
@aryarm

Description

@aryarm

Our current test dataset comprises all of chr1 in two different samples: the Jurkat sample and the MOLT4 cell line. It takes about an hour to run the entire pipeline with this dataset.

Ideally, we would have a dataset that runs in under 10 mins or so. This could then be incorporated into a Github CI pipeline that runs automatically upon release of each major and minor version increment, so that we can know when a change that we've made to the code leads to a change in the results.

  • find SNVs and indels supported by all callers
  • choose just one or two peaks that overlap those variants from each of the two samples
  • subset the example dataset to reads that only overlap those peaks
  • also try to subset the reference genome that is packaged with the example data, since the ref genome appears to be the largest file, right now
  • rerun the pipeline with the smaller dataset and tweak the dataset as necessary to make it run quickly
  • use snakemake --generate-unit-tests to create a bunch of tests that can be executed using pytest
    • I'm running into issues with this. It doesn't work for outputs marked as pipe and there are some problems with other directories (see edge cases fail with --generate-unit-tests snakemake/snakemake#1104)
    • fix issues and ensure test coverage is appropriate
    • remove any unnecessary tests to ensure the test directory is small and can be properly included in version history (edit: this won't be possible, after all - b/c the test directory has to include the outputs of each rule ugh)
  • (optionally) create a Github action like this one to execute pytest upon each major or minor version increment and confirm the tests pass successfully

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions