Skip to content
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 30 additions & 12 deletions .github/workflows/build_workflow.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,21 +19,24 @@ jobs:
- name: Checkout Code Repository
uses: actions/checkout@v3

- name: Set up Python 3.9
- name: Set up Python 3.13
uses: actions/setup-python@v4
with:
python-version: 3.9
python-version: "3.13"

# Run all pre-commit hooks on all the files.
# Getting only staged files can be tricky in case a new PR is opened
# since the action is run on a branch in detached head state.
# This is the equivalent of running "pre-commit run --all-files" locally.
# If you commit with the `--no-verify` flag, this check may fail.
- name: Install and Run Pre-commit
uses: pre-commit/[email protected].0
uses: pre-commit/[email protected].1

build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.11", "3.12", "3.13"]
defaults:
run:
shell: bash -l {0}
Expand All @@ -44,11 +47,11 @@ jobs:
- name: Cache Conda
uses: actions/cache@v3
env:
CACHE_NUMBER: 0
CACHE_NUMBER: 1 # Increment this to invalidate cache
with:
path: ~/conda_pkgs_dir
key: ${{ runner.os }}-conda-${{ env.CACHE_NUMBER }}-${{
hashFiles('conda/dev.yml') }}
hashFiles('conda/dev.yml') }}-python${{ matrix.python-version }}

- name: Build Conda Environment
uses: conda-incubator/setup-miniconda@v3
Expand All @@ -57,13 +60,27 @@ jobs:
miniforge-variant: Miniforge3
miniforge-version: latest
environment-file: conda/dev.yml
channel-priority: strict
channel-priority: flexible # Changed from strict to flexible
auto-update-conda: true
python-version: ${{ matrix.python-version }}
channels: conda-forge
use-only-tar-bz2: true

- name: Verify Environment and Fix Dependencies
run: |
conda info
conda list
# Ensure we have the right Python version
python --version
# Fix pip issues for Python 3.12+
if [[ "${{ matrix.python-version }}" == "3.12" ]] || [[ "${{ matrix.python-version }}" == "3.13" ]]; then
python -m ensurepip --upgrade || true
python -m pip install --upgrade --force-reinstall pip setuptools wheel
fi

- name: Install `zstash` Package
run: |
python -m pip install --upgrade pip
pip install .
python -m pip install .

- name: Run Tests
run: |
Expand All @@ -77,7 +94,7 @@ jobs:
defaults:
run:
shell: bash -l {0}
timeout-minutes: 5
timeout-minutes: 10 # Increased timeout for docs
steps:
- uses: actions/checkout@v3
with:
Expand All @@ -87,11 +104,11 @@ jobs:
- name: Cache Conda
uses: actions/cache@v3
env:
CACHE_NUMBER: 0
CACHE_NUMBER: 1 # Match the build job cache number
with:
path: ~/conda_pkgs_dir
key: ${{ runner.os }}-conda-${{ env.CACHE_NUMBER }}-${{
hashFiles('conda/dev.yml') }}
hashFiles('conda/dev.yml') }}-docs

- name: Build Conda Environment
uses: conda-incubator/setup-miniconda@v3
Expand All @@ -100,8 +117,9 @@ jobs:
miniforge-variant: Miniforge3
miniforge-version: latest
environment-file: conda/dev.yml
channel-priority: strict
channel-priority: flexible # Changed from strict to flexible
auto-update-conda: true
python-version: "3.13" # Use stable Python version for docs

# sphinx-multiversion allows for version docs.
- name: Build Sphinx Docs
Expand Down
8 changes: 4 additions & 4 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,27 +12,27 @@ repos:
exclude: conda/meta.yaml

- repo: https://github.com/psf/black
rev: 24.10.0
rev: 25.1.0
hooks:
- id: black

- repo: https://github.com/PyCQA/isort
rev: 5.13.2
rev: 6.0.1
hooks:
- id: isort

# Need to use flake8 GitHub mirror due to CentOS git issue with GitLab
# https://github.com/pre-commit/pre-commit/issues/1206
- repo: https://github.com/pycqa/flake8
rev: 7.1.1
rev: 7.3.0
hooks:
- id: flake8
args: ["--config=setup.cfg"]
additional_dependencies: [flake8-isort]
exclude: analysis_data_preprocess

- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.11.2
rev: v1.18.2
hooks:
- id: mypy
args: ["--config=setup.cfg", "--install-types", "--non-interactive"]
Expand Down
33 changes: 16 additions & 17 deletions conda/dev.yml
Original file line number Diff line number Diff line change
@@ -1,31 +1,30 @@
name: zstash_dev
channels:
- conda-forge
- defaults
dependencies:
# Base
# =================
- pip=22.2.2
- python=3.9.13
- six=1.16.0
- globus-sdk=3.15.0
- pip
- python >=3.11,<3.14
- sqlite
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomvothecoder How do we determine the constraints (=, >=, etc.)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on the conda-forge recipe and compatibility with other packages. If things break with new package versions, that suggests updating the constraints accordingly.

- six >=1.16.0
- globus-sdk >=3.15.0
# Developer Tools
# =================
# If versions are updated, also update 'rev' in `.pre-commit.config.yaml`
- black=24.10.0
- flake8=7.1.1
- flake8-isort=6.1.1
- mypy=1.11.2
- pre-commit=4.0.1
- tbump=6.9.0
- black ==25.1.0
- flake8 ==7.3.0
- isort ==6.0.1
- mypy ==1.18.2
- pre-commit ==4.3.0
- tbump >=6.9.0
# Documentation
# =================
# If versions are updated, also update in `.github/workflows/workflow.yml`
- jinja2<3.1
- sphinx=5.2.3
- sphinx-multiversion=0.2.4
- sphinx_rtd_theme=1.0.0
- jinja2 <3.1
- sphinx >=5.2.0
- sphinx-multiversion >=0.2.4
- sphinx_rtd_theme >=1.0.0
# Need to pin docutils because 0.17 has a bug with unordered lists
# https://github.com/readthedocs/sphinx_rtd_theme/issues/1115
- docutils=0.16
prefix: /opt/miniconda3/envs/zstash_dev
- docutils >=0.16,<0.17
35 changes: 0 additions & 35 deletions conda/meta.yaml

This file was deleted.

2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ exclude =
venv

[mypy]
python_version = 3.9
python_version = 3.13
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we choose the specific version for this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest stable version of Python.

check_untyped_defs = True
ignore_missing_imports = True
warn_unused_ignores = True
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,6 @@
author_email="[email protected], [email protected], [email protected]",
description="Long term HPSS archiving software for E3SM",
packages=find_packages(include=["zstash", "zstash.*"]),
python_requires=">=3.9",
python_requires=">=3.11,<3.14",
entry_points={"console_scripts": ["zstash=zstash.main:main"]},
)
4 changes: 2 additions & 2 deletions zstash/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -591,7 +591,7 @@ def extractFiles( # noqa: C901
try:
# Seek file position
if tar.fileobj is not None:
fileobj: _io.BufferedReader = tar.fileobj
fileobj = tar.fileobj
else:
raise TypeError("Invalid tar.fileobj={}".format(tar.fileobj))
fileobj.seek(files_row.offset)
Expand Down Expand Up @@ -665,7 +665,7 @@ def extractFiles( # noqa: C901
# relying here on 'touch'. This is not the prettiest solution.
# Maybe a better one can be implemented later.
if tarinfo.issym():
tmp1: int = tarinfo.mtime
tmp1 = tarinfo.mtime
tmp2: datetime = datetime.fromtimestamp(tmp1)
tmp3: str = tmp2.strftime("%Y%m%d%H%M.%S")
os.system("touch -h -t %s %s" % (tmp3, tarinfo.name))
Expand Down
11 changes: 1 addition & 10 deletions zstash/globus.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,6 @@ def globus_activate(hpss: str):


def file_exists(name: str) -> bool:
global archive_directory_listing

for entry in archive_directory_listing:
if entry.get("name") == name:
Expand All @@ -72,9 +71,6 @@ def file_exists(name: str) -> bool:
def globus_transfer( # noqa: C901
remote_ep: str, remote_path: str, name: str, transfer_type: str, non_blocking: bool
):
global transfer_client
global local_endpoint
global remote_endpoint
global transfer_data
global task_id
global archive_directory_listing
Expand Down Expand Up @@ -199,7 +195,6 @@ def globus_transfer( # noqa: C901
def globus_block_wait(
task_id: str, wait_timeout: int, polling_interval: int, max_retries: int
):
global transfer_client

# poll every "polling_interval" seconds to speed up small transfers. Report every 2 hours, stop waiting aftert 5*2 = 10 hours
logger.info(
Expand All @@ -211,7 +206,7 @@ def globus_block_wait(
try:
# Wait for the task to complete
logger.info(
f"{ts_utc()}: on task_wait try {retry_count+1} out of {max_retries}"
f"{ts_utc()}: on task_wait try {retry_count + 1} out of {max_retries}"
)
transfer_client.task_wait(
task_id, timeout=wait_timeout, polling_interval=10
Expand Down Expand Up @@ -244,7 +239,6 @@ def globus_block_wait(


def globus_wait(task_id: str):
global transfer_client

try:
"""
Expand Down Expand Up @@ -288,9 +282,6 @@ def globus_wait(task_id: str):


def globus_finalize(non_blocking: bool = False):
global transfer_client
global transfer_data
global task_id
global global_variable_tarfiles_pushed

last_task_id = None
Expand Down
43 changes: 14 additions & 29 deletions zstash/hpss_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@
from typing import List, Optional, Tuple

import _hashlib
import _io

from .hpss import hpss_put
from .settings import BLOCK_SIZE, TupleFilesRowNoId, TupleTarsRowNoId, config, logger
Expand All @@ -20,7 +19,7 @@
# Minimum output file object
class HashIO(object):
def __init__(self, name: str, mode: str, do_hash: bool):
self.f: _io.BufferedWriter = open(name, mode)
self.f = open(name, mode)
self.hash: Optional[_hashlib.HASH]
if do_hash:
self.hash = hashlib.md5()
Expand Down Expand Up @@ -281,43 +280,29 @@ def add_file(
# Change the size of any hardlinks from 0 to the size of the actual file
if tarinfo.islnk():
tarinfo.size = os.path.getsize(file_name)

# Add the file to the tar
tar.addfile(tarinfo)
if (tarinfo.isfile() or tarinfo.islnk()) and tarinfo.size > 0:
with open(file_name, "rb") as fileobj:
tar.addfile(tarinfo, fileobj)
else:
tar.addfile(tarinfo)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the newly supported versions of Python, addfile writes to fileobj.


md5: Optional[str] = None
# Only add files or hardlinks.
# (So don't add directories or softlinks.)
if tarinfo.isfile() or tarinfo.islnk():
f: _io.TextIOWrapper = open(file_name, "rb")
f = open(file_name, "rb")
hash_md5: _hashlib.HASH = hashlib.md5()
if tar.fileobj is not None:
fileobj: _io.BufferedWriter = tar.fileobj
else:
raise TypeError("Invalid tar.fileobj={}".format(tar.fileobj))

while True:
s: str = f.read(BLOCK_SIZE)
if len(s) > 0:
# If the block read in is non-empty, write it to fileobj and update the hash
fileobj.write(s)
hash_md5.update(s)
if len(s) < BLOCK_SIZE:
# If the block read in is smaller than BLOCK_SIZE,
# then we have reached the end of the file.
# blocks = how many blocks of tarfile.BLOCKSIZE fit in tarinfo.size
# remainder = how much more content is required to reach tarinfo.size
blocks: int
remainder: int
blocks, remainder = divmod(tarinfo.size, tarfile.BLOCKSIZE)
if remainder > 0:
null_bytes: bytes = tarfile.NUL
# Write null_bytes to get the last block to tarfile.BLOCKSIZE
fileobj.write(null_bytes * (tarfile.BLOCKSIZE - remainder))
blocks += 1
# Increase the offset by the amount already saved to the tar
tar.offset += blocks * tarfile.BLOCKSIZE
Comment on lines -304 to -317
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears we don't actually need to be tracking/updating tar.offset here.

Copy link
Collaborator Author

@forsyth2 forsyth2 Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After following addfile back through the git blame functionality, it appears to be from the very beginning of zstash: https://github.com/E3SM-Project/zstash/pull/2/files#diff-51246e53255db77c9edad496f074aa1bdbf8dbdc11f89a02040115c9ab4fa7f0 has the following.

# Add file to tar archive while computing its hash
# Return file offset (in tar archive), size and md5 hash
def addfile(tar, file):
    offset = tar.offset
    tarinfo = tar.gettarinfo(file)
    tar.addfile(tarinfo)
    if tarinfo.isfile():
        f = open(file, "rb")
        hash_md5 = hashlib.md5()
        while True:
            s = f.read(BLOCK_SIZE)
            if len(s) > 0:
                tar.fileobj.write(s)
                hash_md5.update(s)
            if len(s) < BLOCK_SIZE:
                blocks, remainder = divmod(tarinfo.size, tarfile.BLOCKSIZE)
                if remainder > 0:
                    tar.fileobj.write(tarfile.NUL *
                                      (tarfile.BLOCKSIZE - remainder))
                    blocks += 1
                tar.offset += blocks * tarfile.BLOCKSIZE
                break
        f.close()
        md5 = hash_md5.hexdigest()
    else:
        md5 = None
    size = tarinfo.size
    mtime = datetime.utcfromtimestamp(tarinfo.mtime)
    return offset, size, mtime, md5

So, I just want to make sure the new tar.addfile takes care of the extra stuff in this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading of the documentation is that yes, the tar file is updated already with the tar.add_file() call when a fileobj is passed:
https://docs.python.org/3/library/tarfile.html

The advantage of the original implementation was probably that each file got opened once only and was added to the archive at the same time as computing its md5 hash. The new code reopens the file to create the md5 hash but I think that's fine (and seemingly unavoidable with the new API).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but I cannot approve this change. The ability to stream data to tar files and compute hashes at the same time was a key functionality of zstash. Computation of md5 hashes is expensive, removing this functionality will have a significant performance impact.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude's feedback:

If the old implementation breaks on Python 3.13 (likely due to changes in how tarfile handles internal file objects), then you're facing a compatibility vs. performance trade-off.

Options to Consider

1. Accept the performance hit (pragmatic choice)

  • The new implementation works on Python 3.13
  • Performance regression may be acceptable depending on:
    • Typical file sizes in your use case
    • Available I/O bandwidth (SSDs make double-reads less painful)
    • Whether this is a bottleneck in practice

2. Implement a hybrid solution (best of both worlds)

def add_file(tar, file_name, follow_symlinks):
    offset = tar.offset
    tarinfo = tar.gettarinfo(file_name)
    
    if tarinfo.islnk():
        tarinfo.size = os.path.getsize(file_name)
    
    md5 = None
    
    # For files/hardlinks, stream data while computing hash
    if (tarinfo.isfile() or tarinfo.islnk()) and tarinfo.size > 0:
        hash_md5 = hashlib.md5()
        
        # Create a wrapper that computes hash while data passes through
        class HashingFileWrapper:
            def __init__(self, fileobj, hasher):
                self.fileobj = fileobj
                self.hasher = hasher
            
            def read(self, size=-1):
                data = self.fileobj.read(size)
                if data:
                    self.hasher.update(data)
                return data
        
        with open(file_name, "rb") as f:
            wrapper = HashingFileWrapper(f, hash_md5)
            tar.addfile(tarinfo, wrapper)
        
        md5 = hash_md5.hexdigest()
    else:
        tar.addfile(tarinfo)
    
    size = tarinfo.size
    mtime = datetime.utcfromtimestamp(tarinfo.mtime)
    return offset, size, mtime, md5

This wrapper approach:

  • ✅ Uses the proper Python 3.13-compatible API
  • ✅ Maintains single-pass streaming
  • ✅ Computes hash during the tar write operation
  • ✅ Clean, maintainable code

3. Version-specific implementations

Use the old code for Python <3.13 and new code for 3.13+, but this adds maintenance burden.

My Recommendation

Try the hybrid solution first (Option 2). It should give you Python 3.13 compatibility while preserving the streaming performance that Reviewer 2 correctly identified as important. If tar.addfile() properly uses the file-like object's read() method, this should work perfectly.

If the wrapper approach doesn't work for some reason, then you'll need to accept the performance regression as the cost of Python 3.13 support, but at least you'll have tried to preserve the optimization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@golaz, I will veto any solution that does not allow us to advance to the latest stable python (python 3.13), since this is something I have been fighting to have happen in Unified for years and it has finally become possible because of dropping CDAT.

It seems like zstash has not been getting the TLC it has needed over the years to remain compatible with modern python and that's pretty disappointing. I think @forsyth2 is spread too thin and no one else has stepped up. Goodness knows I'm spread too thin.

Hopefully, we can get things there for Unified, since testing should start on Monday. Hopefully, we can also find a longer term solution for zstash to have a "champion".

Copy link
Collaborator

@chengzhuzhang chengzhuzhang Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I haven’t followed this thread too closely. Which Python version is causing the incompatibility with zstash? I think it’s a solid achievement that we could update from 3.10 (now end-of-life) to 3.11 (or higher, if our tools are ready) for Unified with all the work we put in to migrate from CDAT. In this case, it seems zstash isn’t ready yet, and we are time-constrained to update it with thorough testing.

Can we target Python 3.11 or 3.12 (depending on zstash readiness) for this upcoming unified? Both are still widely used, and 3.11 will continue receiving security updates until October 2027. Again, for a tool like E3SM-unified, I believe we should continue prioritizing maximum compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AAAAAAaargh!!

Okay, I think it would be important to find out if zstash works with python 3.12 and without requiring these changes. If we only support python 3.11 in zstash and Unified, that's not good. That's the state we just got out of with dropping CDAT.

I also never want this to happen again with a future Unified. All our software must, must, must support the 3 latest stable versions of python at all times. This is just necessary for our ecosystem to function well. Otherwise, the neglected tools really drag down the tools that are trying to take advantage of more modern capabilities.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could get behind the concept of option 2.
Hopefully, we can get things there for Unified, since testing should start on Monday.

I've implemented this in the latest commit: cc37237. It required an additional change to handle empty files. It now passes the GitHub Actions for 3.11, 3.12, and 3.13.

I believe this means this PR should be good to merge. @xylar @golaz Can you please re-confirm based on this latest commit?

zstash has not been getting the TLC it has needed over the years to remain compatible with modern python
Hopefully, we can also find a longer term solution for zstash to have a "champion".

Yes, I agree with all this. After the Unified testing period, perhaps it would be useful to have a team discussion about this. The problem is certainly not a lack of ideas re: code maintainability (e.g., removing all those global variables or improving the testing).

Which Python version is causing the incompatibility with zstash?

3.13

find out if zstash works with python 3.12 and without requiring these changes.

It does. I only ran into errors when I started matrix-testing on 3.13.

All our software must, must, must support the 3 latest stable versions of python at all times.

This is good to keep in mind, thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we target Python 3.11 or 3.12 (depending on zstash readiness) for this upcoming unified? Both are still widely used, and 3.11 will continue receiving security updates until October 2027. Again, for a tool like E3SM-unified, I believe we should continue prioritizing maximum compatibility.

I also never want this to happen again with a future Unified. All our software must, must, must support the 3 latest stable versions of python at all times. This is just necessary for our ecosystem to function well. Otherwise, the neglected tools really drag down the tools that are trying to take advantage of more modern capabilities.
Here’s a tightened version of what you have now—clear, direct, and avoids repetition:

I agree with @chengzhuzhang that maximizing compatibility is essential, and with @xylar that we must support the latest stable Python versions (currently 3.11, 3.12, 3.13). Falling behind with Python support (mainly due to CDAT) means we’re not truly maximizing compatibility and leaves us playing catch-up with the broader ecosystem.

Our sub-dependencies (e.g., numpy) already expect newer Python versions, which forces @xylar into hacks and workarounds to get things working. Major scientific packages generally follow this Python support spec: https://scientific-python.org/specs/spec-0000/. We might consider adopting this more formally as a guideline to ensure our packages are always ready moving forward.

data = f.read(BLOCK_SIZE)
if len(data) > 0:
hash_md5.update(data)
if len(data) < BLOCK_SIZE:
break
f.close()
md5 = hash_md5.hexdigest()
f.close()
size: int = tarinfo.size
mtime: datetime = datetime.utcfromtimestamp(tarinfo.mtime)
return offset, size, mtime, md5
Loading