Skip to content

CCBR/parkit

Repository files navigation

parkit πŸ…ΏοΈ πŸš™

Park an archived project toolkit!

build codecov

DISCLAIMERS:

Background

When a project comes to a completion, most analysts have folders (or .tar files) on Biowulf/Helix which contain either:

  • rawdata eg. Fastq files.
  • processed data generated by CCBR pipelines and/or other downstream analysis custom tools.

The analyst can use parkit to park these folders directly on to HPCDME's CCBR_Archive object store vault. A typical project, say ccbrXYZ, can be parked at /CCBR_Archive/GRIDFTP/Project_CCBR-XYZ with collections "Analysis" and "Rawdata".

!!! note projark command is preferred for CCBR project arkiving

projark Quick Start (Recommended)

projark is the preferred interface for archiving and retrieving CCBR project data.

  • deposit: archive a local project folder to HPC-DME.
  • retrieve: pull archived objects back from HPC-DME to local scratch.

Version:

projark --version

Archive (deposit) example:

projark deposit --folder /data/CCBR/projects/CCBR-12345 --projectnumber 12345 --datatype Analysis

Archive (deposit) short-flag example:

projark deposit -f /data/CCBR/projects/CCBR-12345 -p CCBR-12345 -d Analysis -s 250 -k

Retrieve selected files example:

projark retrieve --projectnumber 12345 --datatype Analysis --filenames new.tar_0001,new.tar_0002 --unsplit

Retrieve selected files short-flag example:

projark retrieve -p CCBR-12345 -d Analysis -n new.tar_0001,new.tar_0002 -u

Retrieve full collection example (omit --filenames):

projark retrieve --projectnumber 12345 --datatype Analysis --unsplit

Useful deposit flags:

  • -f, --folder <path>: input folder to archive.
  • -p, --projectnumber <value>: project identifier.
  • -d, --datatype <Analysis|Rawdata>: datatype (default Analysis).
  • --tarname <name>.tar: override default tar name.
  • -t, --tarname <name>.tar: override default tar name.
  • -s, --split-size-gb <N>: split threshold/chunk size (default 500 GB).
  • -k, --no-cleanup: keep scratch artifacts after successful transfer.

Useful retrieve flags:

  • -f, --folder /path: override default local base (/scratch/$USER/CCBR-<projectnumber>).
  • -p, --projectnumber <value>: project identifier.
  • -d, --datatype <Analysis|Rawdata>: datatype (default Analysis).
  • -n, --filenames a,b,c: retrieve specific objects.
  • -u, --unsplit (alias --unspilt): merge downloaded split tar parts.

--projectnumber normalization:

  • Accepts any non-empty value.
  • Repeated leading prefixes ccbr, CCBR, Ccbr are removed (each may be followed by _, -, or nothing).
  • Examples:
    • CCBR-1234 -> 1234
    • CCBR-abcd -> abcd
    • ccbr_ccbr-1234abc -> 1234abc

Path behavior:

  • --folder FASTQ and --folder FASTQ/ are both accepted.
  • Relative --folder values are resolved to absolute paths before use.

Runtime logging and notifications:

  • Log lines are timestamped in ISO 8601 format (for example 2026-02-19T14:37:52-05:00).
  • Completion/failure email is sent to $USER@nih.gov.
  • Notification sender is NCICCBR@mail.nih.gov.

Prerequisites:

  • On helix or biowulf you can get access to parkit by loading the appropriate conda env
mamba activate /vf/users/CCBR_Pipeliner/db/PipeDB/miniforge3/envs/parkit
  • HPC_DME_APIs package needs to be cloned and set up correctly. Follow this setup guide: https://ccbr.github.io/HowTos/docs/HPCDME/setup.html

  • HPC_DM_UTILS environmental variable should be preset before calling parkit. It also needs to be passed as an argument to parkit_folder2hpcdme and parkit_tarball2hpcdme end-to-end workflows.

  • HPC_DM_JAVA_VERSION minimum required value is 23 as of today.

  • Run all operations from tmux, screen, or an Open OnDemand graphical session.

  • Disclaimer: Open OnDemand is currently available only on Biowulf compute nodes, not directly on Helix. Since projark is Helix-only today, use tmux/screen on Helix; Open OnDemand support is future-facing until Helix access is available.

  • If mamba is not already in your PATH, add the following block to your ~/.bashrc or ~/.zshrc, then run mamba activate /vf/users/CCBR_Pipeliner/db/PipeDB/miniforge3/envs/parkit:

# >>> mamba initialize >>>
# !! Contents within this block are managed by 'mamba shell init' !!
export MAMBA_EXE='/vf/users/CCBR_Pipeliner/db/PipeDB/miniforge3/bin/mamba';
export MAMBA_ROOT_PREFIX='/vf/users/CCBR_Pipeliner/db/PipeDB/miniforge3';
__mamba_setup="$("$MAMBA_EXE" shell hook --shell zsh --root-prefix "$MAMBA_ROOT_PREFIX" 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__mamba_setup"
else
    alias mamba="$MAMBA_EXE"  # Fallback on help from mamba activate
fi
unset __mamba_setup
# <<< mamba initialize <<<

Usage:

parkit --help

Output:

usage: parkit [-h] {createtar,tarprep,checkapisync,syncapi,createmetadata,createemptycollection,deposittar} ...

parkit subcommands to park data in HPCDME

positional arguments:
  {createtar,tarprep,checkapisync,syncapi,createmetadata,createemptycollection,deposittar}
                        Subcommand to run
    createtar           create tarball(and its filelist) from a project folder.
    tarprep             prepare tarball for upload.
    checkapisync        check whether the HPC_DME_APIs repository is in sync with upstream
    syncapi             sync HPC_DME_APIs with upstream and generate a fresh token
    createmetadata      create the metadata.json file required for a tarball (and its filelist)
    createemptycollection
                        creates empty project and analysis collections
    deposittar          deposit tarball(and filelist) into vault

options:
  -h, --help            show this help message and exit

parkit checkapisync

Check whether your local HPC_DME_APIs git repository is synchronized with its upstream remote.

  • Repo resolution order:
    1. HPC_DME_APIs environment variable
    2. Parent directory of HPC_DM_UTILS (if it ends with utils)
    3. Fallback path: /data/kopardevn/SandBox/HPC_DME_APIs
  • Optional override:
    • parkit checkapisync --repo /path/to/HPC_DME_APIs

Example:

parkit checkapisync

Output:

WARNING:HPC_DM_JAVA_VERSION is not set, setting it to 23.0.2
WARNING:If you need a different version, please set it explicitly in your environment.
HPC_DME_APIs=/data/kopardevn/GitRepos/HPC_DME_APIs
branch=master
local=08a93b22a74bb51455489b70bcc36ea528c74d23
remote=9d01b7b6686266707c9720285aecefd4f46323f3
base=dabfe0a5789ba89a1f17f1c0868eaad284f0b62a
OUT OF SYNC: local and upstream have diverged.

parkit syncapi

Sync local HPC_DME_APIs and generate a fresh API token. You can also use this command just to generate/refresh a new token (even if git is already up to date).

What it does:

  • resolves HPC_DME_APIs repo path
  • runs git pull
  • runs source <repo>/functions && dm_generate_token
  • prompts for password/token input interactively in your terminal
  • prints success message on completion

Example:

parkit syncapi

Output:

HPC_DME_APIs=/data/kopardevn/GitRepos/HPC_DME_APIs
... git pull output ...
... dm_generate_token prompts for password ...
Now you are in sync and ready to run other parkit commands.

Example failure and interpretation:

parkit syncapi

Output:

WARNING:HPC_DM_JAVA_VERSION is not set, setting it to 23.0.2
WARNING:If you need a different version, please set it explicitly in your environment.
HPC_DME_APIs=/data/kopardevn/GitRepos/HPC_DME_APIs
Merge made by the 'recursive' strategy.
 .github/workflows/build-deploy-api.yml | 8 ++++++--
 .github/workflows/build-dev-api.yml    | 4 +++-
 .github/workflows/deploy-dev-api.yml   | 4 +++-
 3 files changed, 12 insertions(+), 4 deletions(-)
Enter host password for user 'kopardevn':
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6065    0  6065    0     0  49713      0 --:--:-- --:--:-- --:--:-- 49713
ERROR: No token found in /data/kopardevn/GitRepos/HPC_DME_APIs/utils/temp/log
ERROR MESSAGE: Access Denied: LDAP authentication failed
syncapi failed: dm_generate_token was not successful.

What this means:

  • git pull succeeded (your local API repo updated).
  • Token generation failed at authentication time (LDAP authentication failed).
  • You are synced to latest code, but not ready for data operations until token generation succeeds.
  • Re-run parkit syncapi and enter valid credentials; if it still fails, inspect HPC_DME_APIs/utils/temp/log.

Example:

  • Say you want to archive /data/CCBR/projects/CCBR-12345 folder to /CCBR_Archive/GRIDFTP/Project_CCBR-12345 collection on HPC-DME
  • you can run the following commands sequentially to do this:
# create the tarball
parkit createtar --folder /data/CCBR/projects/ccbr_12345
# the above command will creates the following files:
# - ccbr_12345.tar
# - ccbr_12345.tar.md5
# - ccbr_12345.tar.filelist
# - ccbr_12345.tar.filelist.md5

# create an empty collection on HPC-DME
parkit createemptycollection --dest /CCBR_Archive/GRIDFTP/Project_CCBR-12345 --projectdesc "testing" --projecttitle "test project 1"
# the above command creates collections:
# - /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# - /CCBR_Archive/GRIDFTP/Project_CCBR-12345/Analysis
# - /CCBR_Archive/GRIDFTP/Project_CCBR-12345/Rawdata

# create required metadata
parkit createmetadata --tarball /data/CCBR/projects/ccbr_12345.tar --dest /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# if ccbr_12345.tar is rawdata then "--collectiontype Rawdata" argument needs to be added to the above commandline

# deposit the tar into HPC-DME
parkit deposittar --tarball /data/CCBR/projects/ccbr_12345.tar --dest /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# if ccbr_12345.tar is rawdata then "--collectiontype Rawdata" argument needs to be added to the above commandline

# bunch of extra files are created in the process
ls /data/CCBR/projects/ccbr_12345.tar*

```bash
# delete the recently parked project folder contents including hidden contents
rm -rf /data/CCBR/projects/CCBR-12345/*

# copy filelist into the empty project folder for future quick reference
cp /data/CCBR/projects/ccbr_12345.tar.filelist /data/CCBR/projects/CCBR-12345/ccbr_12345.tar.filelist

# delete files created by parkit
rm -f /data/CCBR/projects/ccbr_12345.tar*

# test results with
dm_get_collection /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# Done!

Output from ls /data/CCBR/projects/ccbr_12345.tar*:

/data/CCBR/projects/ccbr_12345.tar           /data/CCBR/projects/ccbr_12345.tar.filelist.md5            /data/CCBR/projects/ccbr_12345.tar.md5
/data/CCBR/projects/ccbr_12345.tar.filelist  /data/CCBR/projects/ccbr_12345.tar.filelist.metadata.json  /data/CCBR/projects/ccbr_12345.tar.metadata.json

We also have end-to-end slurm-supported folder-to-hpcdme and tarball-to-hpcdme workflows:

  • parkit_folder2hpcdme
  • parkit_tarball2hpcdme and
  • projark [ recommended for archiving CCBR projects to GRIPFTP folder under CCBR_Archive ]

If run with --executor slurm this interfaces with the job scheduler on Biowulf and submitted individual steps of these E2E workflows as interdependent jobs.

parkit_folder2hpcdme

parkit_folder2hpcdme --help

Output:

usage: parkit_folder2hpcdme [-h] [--restartfrom RESTARTFROM] [--executor EXECUTOR] [--folder FOLDER] [--dest DEST] [--projectdesc PROJECTDESC]
                            [--projecttitle PROJECTTITLE] [--rawdata] [--cleanup] --hpcdmutilspath HPCDMUTILSPATH [--version]

End-to-end parkit: Folder 2 HPCDME

options:
  -h, --help            show this help message and exit
  --restartfrom RESTARTFROM
                        if restarting then restart from this step. Options are: createemptycollection, createmetadata, deposittar
  --executor EXECUTOR   slurm or local
  --folder FOLDER       project folder to archive
  --dest DEST           vault collection path (Analysis goes under here!)
  --projectdesc PROJECTDESC
                        project description
  --projecttitle PROJECTTITLE
                        project title
  --rawdata             If tarball is rawdata and needs to go under folder Rawdata
  --cleanup             post transfer step to delete local files
  --hpcdmutilspath HPCDMUTILSPATH
                        what should be the value of env var HPC_DM_UTILS
  --version             print version

parkit_tarball2hpcdme

parkit_tarball2hpcdme --help

Output:

usage: parkit_tarball2hpcdme [-h] [--restartfrom RESTARTFROM] [--executor EXECUTOR] [--tarball TARBALL] [--dest DEST]
                             [--projectdesc PROJECTDESC] [--projecttitle PROJECTTITLE] [--cleanup] --hpcdmutilspath HPCDMUTILSPATH
                             [--version]

End-to-end parkit: Tarball 2 HPCDME

options:
  -h, --help            show this help message and exit
  --restartfrom RESTARTFROM
                        if restarting then restart from this step. Options are: createemptycollection, createmetadata, deposittar
  --executor EXECUTOR   slurm or local
  --tarball TARBALL     project tarball to archive
  --dest DEST           vault collection path (Analysis goes under here!)
  --projectdesc PROJECTDESC
                        project description
  --projecttitle PROJECTTITLE
                        project title
  --cleanup             post transfer step to delete local files
  --hpcdmutilspath HPCDMUTILSPATH
                        what should be the value of env var HPC_DM_UTILS
  --version             print version

About

Park folders or tarballs into HPCDME πŸš™

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 5