Park an archived project toolkit!
DISCLAIMERS:
When a project comes to a completion, most analysts have folders (or .tar files) on Biowulf/Helix which contain either:
- rawdata eg. Fastq files.
- processed data generated by CCBR pipelines and/or other downstream analysis custom tools.
The analyst can use parkit to park these folders directly on to HPCDME's CCBR_Archive object store vault. A typical project, say ccbrXYZ, can be parked at /CCBR_Archive/GRIDFTP/Project_CCBR-XYZ with collections "Analysis" and "Rawdata".
!!! note projark command is preferred for CCBR project arkiving
projark is the preferred interface for archiving and retrieving CCBR project data.
deposit: archive a local project folder to HPC-DME.retrieve: pull archived objects back from HPC-DME to local scratch.
Version:
projark --versionArchive (deposit) example:
projark deposit --folder /data/CCBR/projects/CCBR-12345 --projectnumber 12345 --datatype AnalysisArchive (deposit) short-flag example:
projark deposit -f /data/CCBR/projects/CCBR-12345 -p CCBR-12345 -d Analysis -s 250 -kRetrieve selected files example:
projark retrieve --projectnumber 12345 --datatype Analysis --filenames new.tar_0001,new.tar_0002 --unsplitRetrieve selected files short-flag example:
projark retrieve -p CCBR-12345 -d Analysis -n new.tar_0001,new.tar_0002 -uRetrieve full collection example (omit --filenames):
projark retrieve --projectnumber 12345 --datatype Analysis --unsplitUseful deposit flags:
-f, --folder <path>: input folder to archive.-p, --projectnumber <value>: project identifier.-d, --datatype <Analysis|Rawdata>: datatype (defaultAnalysis).--tarname <name>.tar: override default tar name.-t, --tarname <name>.tar: override default tar name.-s, --split-size-gb <N>: split threshold/chunk size (default500GB).-k, --no-cleanup: keep scratch artifacts after successful transfer.
Useful retrieve flags:
-f, --folder /path: override default local base (/scratch/$USER/CCBR-<projectnumber>).-p, --projectnumber <value>: project identifier.-d, --datatype <Analysis|Rawdata>: datatype (defaultAnalysis).-n, --filenames a,b,c: retrieve specific objects.-u, --unsplit(alias--unspilt): merge downloaded split tar parts.
--projectnumber normalization:
- Accepts any non-empty value.
- Repeated leading prefixes
ccbr,CCBR,Ccbrare removed (each may be followed by_,-, or nothing). - Examples:
CCBR-1234->1234CCBR-abcd->abcdccbr_ccbr-1234abc->1234abc
Path behavior:
--folder FASTQand--folder FASTQ/are both accepted.- Relative
--foldervalues are resolved to absolute paths before use.
Runtime logging and notifications:
- Log lines are timestamped in ISO 8601 format (for example
2026-02-19T14:37:52-05:00). - Completion/failure email is sent to
$USER@nih.gov. - Notification sender is
NCICCBR@mail.nih.gov.
- On helix or biowulf you can get access to
parkitby loading the appropriate conda env
mamba activate /vf/users/CCBR_Pipeliner/db/PipeDB/miniforge3/envs/parkit-
HPC_DME_APIspackage needs to be cloned and set up correctly. Follow this setup guide: https://ccbr.github.io/HowTos/docs/HPCDME/setup.html -
HPC_DM_UTILS environmental variable should be preset before calling
parkit. It also needs to be passed as an argument toparkit_folder2hpcdmeandparkit_tarball2hpcdmeend-to-end workflows. -
HPC_DM_JAVA_VERSION minimum required value is
23as of today. -
Run all operations from
tmux,screen, or an Open OnDemand graphical session. -
Disclaimer: Open OnDemand is currently available only on Biowulf compute nodes, not directly on Helix. Since
projarkis Helix-only today, usetmux/screenon Helix; Open OnDemand support is future-facing until Helix access is available. -
If
mambais not already in yourPATH, add the following block to your~/.bashrcor~/.zshrc, then runmamba activate /vf/users/CCBR_Pipeliner/db/PipeDB/miniforge3/envs/parkit:
# >>> mamba initialize >>>
# !! Contents within this block are managed by 'mamba shell init' !!
export MAMBA_EXE='/vf/users/CCBR_Pipeliner/db/PipeDB/miniforge3/bin/mamba';
export MAMBA_ROOT_PREFIX='/vf/users/CCBR_Pipeliner/db/PipeDB/miniforge3';
__mamba_setup="$("$MAMBA_EXE" shell hook --shell zsh --root-prefix "$MAMBA_ROOT_PREFIX" 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__mamba_setup"
else
alias mamba="$MAMBA_EXE" # Fallback on help from mamba activate
fi
unset __mamba_setup
# <<< mamba initialize <<<parkit --helpOutput:
usage: parkit [-h] {createtar,tarprep,checkapisync,syncapi,createmetadata,createemptycollection,deposittar} ...
parkit subcommands to park data in HPCDME
positional arguments:
{createtar,tarprep,checkapisync,syncapi,createmetadata,createemptycollection,deposittar}
Subcommand to run
createtar create tarball(and its filelist) from a project folder.
tarprep prepare tarball for upload.
checkapisync check whether the HPC_DME_APIs repository is in sync with upstream
syncapi sync HPC_DME_APIs with upstream and generate a fresh token
createmetadata create the metadata.json file required for a tarball (and its filelist)
createemptycollection
creates empty project and analysis collections
deposittar deposit tarball(and filelist) into vault
options:
-h, --help show this help message and exit
Check whether your local HPC_DME_APIs git repository is synchronized with its upstream remote.
- Repo resolution order:
HPC_DME_APIsenvironment variable- Parent directory of
HPC_DM_UTILS(if it ends withutils) - Fallback path:
/data/kopardevn/SandBox/HPC_DME_APIs
- Optional override:
parkit checkapisync --repo /path/to/HPC_DME_APIs
Example:
parkit checkapisyncOutput:
WARNING:HPC_DM_JAVA_VERSION is not set, setting it to 23.0.2
WARNING:If you need a different version, please set it explicitly in your environment.
HPC_DME_APIs=/data/kopardevn/GitRepos/HPC_DME_APIs
branch=master
local=08a93b22a74bb51455489b70bcc36ea528c74d23
remote=9d01b7b6686266707c9720285aecefd4f46323f3
base=dabfe0a5789ba89a1f17f1c0868eaad284f0b62a
OUT OF SYNC: local and upstream have diverged.
Sync local HPC_DME_APIs and generate a fresh API token.
You can also use this command just to generate/refresh a new token (even if git is already up to date).
What it does:
- resolves
HPC_DME_APIsrepo path - runs
git pull - runs
source <repo>/functions && dm_generate_token - prompts for password/token input interactively in your terminal
- prints success message on completion
Example:
parkit syncapiOutput:
HPC_DME_APIs=/data/kopardevn/GitRepos/HPC_DME_APIs
... git pull output ...
... dm_generate_token prompts for password ...
Now you are in sync and ready to run other parkit commands.
Example failure and interpretation:
parkit syncapiOutput:
WARNING:HPC_DM_JAVA_VERSION is not set, setting it to 23.0.2
WARNING:If you need a different version, please set it explicitly in your environment.
HPC_DME_APIs=/data/kopardevn/GitRepos/HPC_DME_APIs
Merge made by the 'recursive' strategy.
.github/workflows/build-deploy-api.yml | 8 ++++++--
.github/workflows/build-dev-api.yml | 4 +++-
.github/workflows/deploy-dev-api.yml | 4 +++-
3 files changed, 12 insertions(+), 4 deletions(-)
Enter host password for user 'kopardevn':
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 6065 0 6065 0 0 49713 0 --:--:-- --:--:-- --:--:-- 49713
ERROR: No token found in /data/kopardevn/GitRepos/HPC_DME_APIs/utils/temp/log
ERROR MESSAGE: Access Denied: LDAP authentication failed
syncapi failed: dm_generate_token was not successful.
What this means:
git pullsucceeded (your local API repo updated).- Token generation failed at authentication time (
LDAP authentication failed). - You are synced to latest code, but not ready for data operations until token generation succeeds.
- Re-run
parkit syncapiand enter valid credentials; if it still fails, inspectHPC_DME_APIs/utils/temp/log.
- Say you want to archive
/data/CCBR/projects/CCBR-12345folder to/CCBR_Archive/GRIDFTP/Project_CCBR-12345collection on HPC-DME - you can run the following commands sequentially to do this:
# create the tarball
parkit createtar --folder /data/CCBR/projects/ccbr_12345
# the above command will creates the following files:
# - ccbr_12345.tar
# - ccbr_12345.tar.md5
# - ccbr_12345.tar.filelist
# - ccbr_12345.tar.filelist.md5
# create an empty collection on HPC-DME
parkit createemptycollection --dest /CCBR_Archive/GRIDFTP/Project_CCBR-12345 --projectdesc "testing" --projecttitle "test project 1"
# the above command creates collections:
# - /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# - /CCBR_Archive/GRIDFTP/Project_CCBR-12345/Analysis
# - /CCBR_Archive/GRIDFTP/Project_CCBR-12345/Rawdata
# create required metadata
parkit createmetadata --tarball /data/CCBR/projects/ccbr_12345.tar --dest /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# if ccbr_12345.tar is rawdata then "--collectiontype Rawdata" argument needs to be added to the above commandline
# deposit the tar into HPC-DME
parkit deposittar --tarball /data/CCBR/projects/ccbr_12345.tar --dest /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# if ccbr_12345.tar is rawdata then "--collectiontype Rawdata" argument needs to be added to the above commandline
# bunch of extra files are created in the process
ls /data/CCBR/projects/ccbr_12345.tar*
```bash
# delete the recently parked project folder contents including hidden contents
rm -rf /data/CCBR/projects/CCBR-12345/*
# copy filelist into the empty project folder for future quick reference
cp /data/CCBR/projects/ccbr_12345.tar.filelist /data/CCBR/projects/CCBR-12345/ccbr_12345.tar.filelist
# delete files created by parkit
rm -f /data/CCBR/projects/ccbr_12345.tar*
# test results with
dm_get_collection /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# Done!Output from ls /data/CCBR/projects/ccbr_12345.tar*:
/data/CCBR/projects/ccbr_12345.tar /data/CCBR/projects/ccbr_12345.tar.filelist.md5 /data/CCBR/projects/ccbr_12345.tar.md5
/data/CCBR/projects/ccbr_12345.tar.filelist /data/CCBR/projects/ccbr_12345.tar.filelist.metadata.json /data/CCBR/projects/ccbr_12345.tar.metadata.json
We also have end-to-end slurm-supported folder-to-hpcdme and tarball-to-hpcdme workflows:
parkit_folder2hpcdmeparkit_tarball2hpcdmeandprojark[ recommended for archiving CCBR projects to GRIPFTP folder under CCBR_Archive ]
If run with --executor slurm this interfaces with the job scheduler on Biowulf and submitted individual steps of these E2E workflows as interdependent jobs.
parkit_folder2hpcdme --helpOutput:
usage: parkit_folder2hpcdme [-h] [--restartfrom RESTARTFROM] [--executor EXECUTOR] [--folder FOLDER] [--dest DEST] [--projectdesc PROJECTDESC]
[--projecttitle PROJECTTITLE] [--rawdata] [--cleanup] --hpcdmutilspath HPCDMUTILSPATH [--version]
End-to-end parkit: Folder 2 HPCDME
options:
-h, --help show this help message and exit
--restartfrom RESTARTFROM
if restarting then restart from this step. Options are: createemptycollection, createmetadata, deposittar
--executor EXECUTOR slurm or local
--folder FOLDER project folder to archive
--dest DEST vault collection path (Analysis goes under here!)
--projectdesc PROJECTDESC
project description
--projecttitle PROJECTTITLE
project title
--rawdata If tarball is rawdata and needs to go under folder Rawdata
--cleanup post transfer step to delete local files
--hpcdmutilspath HPCDMUTILSPATH
what should be the value of env var HPC_DM_UTILS
--version print version
parkit_tarball2hpcdme --helpOutput:
usage: parkit_tarball2hpcdme [-h] [--restartfrom RESTARTFROM] [--executor EXECUTOR] [--tarball TARBALL] [--dest DEST]
[--projectdesc PROJECTDESC] [--projecttitle PROJECTTITLE] [--cleanup] --hpcdmutilspath HPCDMUTILSPATH
[--version]
End-to-end parkit: Tarball 2 HPCDME
options:
-h, --help show this help message and exit
--restartfrom RESTARTFROM
if restarting then restart from this step. Options are: createemptycollection, createmetadata, deposittar
--executor EXECUTOR slurm or local
--tarball TARBALL project tarball to archive
--dest DEST vault collection path (Analysis goes under here!)
--projectdesc PROJECTDESC
project description
--projecttitle PROJECTTITLE
project title
--cleanup post transfer step to delete local files
--hpcdmutilspath HPCDMUTILSPATH
what should be the value of env var HPC_DM_UTILS
--version print version