Skip to content
42 changes: 30 additions & 12 deletions .github/workflows/build_workflow.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,21 +19,24 @@ jobs:
- name: Checkout Code Repository
uses: actions/checkout@v3

- name: Set up Python 3.9
- name: Set up Python 3.13
uses: actions/setup-python@v4
with:
python-version: 3.9
python-version: "3.13"

# Run all pre-commit hooks on all the files.
# Getting only staged files can be tricky in case a new PR is opened
# since the action is run on a branch in detached head state.
# This is the equivalent of running "pre-commit run --all-files" locally.
# If you commit with the `--no-verify` flag, this check may fail.
- name: Install and Run Pre-commit
uses: pre-commit/[email protected].0
uses: pre-commit/[email protected].1

build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.11", "3.12", "3.13"]
defaults:
run:
shell: bash -l {0}
Expand All @@ -44,11 +47,11 @@ jobs:
- name: Cache Conda
uses: actions/cache@v3
env:
CACHE_NUMBER: 0
CACHE_NUMBER: 1 # Increment this to invalidate cache
with:
path: ~/conda_pkgs_dir
key: ${{ runner.os }}-conda-${{ env.CACHE_NUMBER }}-${{
hashFiles('conda/dev.yml') }}
hashFiles('conda/dev.yml') }}-python${{ matrix.python-version }}

- name: Build Conda Environment
uses: conda-incubator/setup-miniconda@v3
Expand All @@ -57,13 +60,27 @@ jobs:
miniforge-variant: Miniforge3
miniforge-version: latest
environment-file: conda/dev.yml
channel-priority: strict
channel-priority: flexible # Changed from strict to flexible
auto-update-conda: true
python-version: ${{ matrix.python-version }}
channels: conda-forge
use-only-tar-bz2: true

- name: Verify Environment and Fix Dependencies
run: |
conda info
conda list
# Ensure we have the right Python version
python --version
# Fix pip issues for Python 3.12+
if [[ "${{ matrix.python-version }}" == "3.12" ]] || [[ "${{ matrix.python-version }}" == "3.13" ]]; then
python -m ensurepip --upgrade || true
python -m pip install --upgrade --force-reinstall pip setuptools wheel
fi
- name: Install `zstash` Package
run: |
python -m pip install --upgrade pip
pip install .
python -m pip install .
- name: Run Tests
run: |
Expand All @@ -77,7 +94,7 @@ jobs:
defaults:
run:
shell: bash -l {0}
timeout-minutes: 5
timeout-minutes: 10 # Increased timeout for docs
steps:
- uses: actions/checkout@v3
with:
Expand All @@ -87,11 +104,11 @@ jobs:
- name: Cache Conda
uses: actions/cache@v3
env:
CACHE_NUMBER: 0
CACHE_NUMBER: 1 # Match the build job cache number
with:
path: ~/conda_pkgs_dir
key: ${{ runner.os }}-conda-${{ env.CACHE_NUMBER }}-${{
hashFiles('conda/dev.yml') }}
hashFiles('conda/dev.yml') }}-docs

- name: Build Conda Environment
uses: conda-incubator/setup-miniconda@v3
Expand All @@ -100,8 +117,9 @@ jobs:
miniforge-variant: Miniforge3
miniforge-version: latest
environment-file: conda/dev.yml
channel-priority: strict
channel-priority: flexible # Changed from strict to flexible
auto-update-conda: true
python-version: "3.13" # Use stable Python version for docs

# sphinx-multiversion allows for version docs.
- name: Build Sphinx Docs
Expand Down
33 changes: 16 additions & 17 deletions conda/dev.yml
Original file line number Diff line number Diff line change
@@ -1,31 +1,30 @@
name: zstash_dev
channels:
- conda-forge
- defaults
dependencies:
# Base
# =================
- pip=22.2.2
- python=3.9.13
- six=1.16.0
- globus-sdk=3.15.0
- pip
- python >=3.11,<3.14
- sqlite
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomvothecoder How do we determine the constraints (=, >=, etc.)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on the conda-forge recipe and compatibility with other packages. If things break with new package versions, that suggests updating the constraints accordingly.

- six >=1.16.0
- globus-sdk >=3.15.0
# Developer Tools
# =================
# If versions are updated, also update 'rev' in `.pre-commit.config.yaml`
- black=24.10.0
- flake8=7.1.1
- flake8-isort=6.1.1
- mypy=1.11.2
- pre-commit=4.0.1
- tbump=6.9.0
- black >=24.0.0
- flake8 >=7.0.0
- flake8-isort >=6.0.0
- mypy >=1.11.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomvothecoder, is it your philosophy that these need to match .pre-commit-config.yaml exactly?

@forsyth2, presumably .pre-commit-config.yaml should also be updated with its autoupdate feature.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomvothecoder, is it your philosophy that these need to match .pre-commit-config.yaml exactly?

Correct, these should align exactly with .pre-commit-config.yaml with exact version pinned as you mentioned before.

- pre-commit >=4.0.0
- tbump >=6.9.0
# Documentation
# =================
# If versions are updated, also update in `.github/workflows/workflow.yml`
- jinja2<3.1
- sphinx=5.2.3
- sphinx-multiversion=0.2.4
- sphinx_rtd_theme=1.0.0
- jinja2 <3.1
- sphinx >=5.2.0
- sphinx-multiversion >=0.2.4
- sphinx_rtd_theme >=1.0.0
# Need to pin docutils because 0.17 has a bug with unordered lists
# https://github.com/readthedocs/sphinx_rtd_theme/issues/1115
- docutils=0.16
prefix: /opt/miniconda3/envs/zstash_dev
- docutils >=0.16,<0.17
10 changes: 5 additions & 5 deletions conda/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,14 @@ build:

requirements:
host:
- python >=3.9
- python >=3.11,<3.14
- pip

run:
- python >=3.9
- fair-research-login >=0.2.6,<0.3.0
- globus-sdk >=3.0.0,<4.0.0
- six
- python >=3.11,<3.14
- globus-sdk=3.15.0
- six=1.16.0
- sqlite
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forsyth2 Did you mean to constrain exact versions of globus-sdk and six here?

On a side-note, I don't think this file is used anymore? I'm pretty sure this file was used to host zstash on the e3sm channel, which we no longer do since adopting conda-forge.

I think you can just delete this file (meta.yaml) if that's the case.


test:
imports:
Expand Down
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ exclude =
venv

[mypy]
python_version = 3.9
python_version = 3.13
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we choose the specific version for this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest stable version of Python.

check_untyped_defs = True
ignore_missing_imports = True
warn_unused_ignores = True
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,6 @@
author_email="[email protected], [email protected], [email protected]",
description="Long term HPSS archiving software for E3SM",
packages=find_packages(include=["zstash", "zstash.*"]),
python_requires=">=3.9",
python_requires=">=3.11,<3.14",
entry_points={"console_scripts": ["zstash=zstash.main:main"]},
)
2 changes: 1 addition & 1 deletion zstash/globus.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ def globus_block_wait(
try:
# Wait for the task to complete
logger.info(
f"{ts_utc()}: on task_wait try {retry_count+1} out of {max_retries}"
f"{ts_utc()}: on task_wait try {retry_count + 1} out of {max_retries}"
)
transfer_client.task_wait(
task_id, timeout=wait_timeout, polling_interval=10
Expand Down
38 changes: 12 additions & 26 deletions zstash/hpss_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -281,43 +281,29 @@ def add_file(
# Change the size of any hardlinks from 0 to the size of the actual file
if tarinfo.islnk():
tarinfo.size = os.path.getsize(file_name)

# Add the file to the tar
tar.addfile(tarinfo)
if (tarinfo.isfile() or tarinfo.islnk()) and tarinfo.size > 0:
with open(file_name, "rb") as fileobj:
tar.addfile(tarinfo, fileobj)
else:
tar.addfile(tarinfo)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the newly supported versions of Python, addfile writes to fileobj.


md5: Optional[str] = None
# Only add files or hardlinks.
# (So don't add directories or softlinks.)
if tarinfo.isfile() or tarinfo.islnk():
f: _io.TextIOWrapper = open(file_name, "rb")
hash_md5: _hashlib.HASH = hashlib.md5()
if tar.fileobj is not None:
fileobj: _io.BufferedWriter = tar.fileobj
else:
raise TypeError("Invalid tar.fileobj={}".format(tar.fileobj))

while True:
s: str = f.read(BLOCK_SIZE)
if len(s) > 0:
# If the block read in is non-empty, write it to fileobj and update the hash
fileobj.write(s)
hash_md5.update(s)
if len(s) < BLOCK_SIZE:
# If the block read in is smaller than BLOCK_SIZE,
# then we have reached the end of the file.
# blocks = how many blocks of tarfile.BLOCKSIZE fit in tarinfo.size
# remainder = how much more content is required to reach tarinfo.size
blocks: int
remainder: int
blocks, remainder = divmod(tarinfo.size, tarfile.BLOCKSIZE)
if remainder > 0:
null_bytes: bytes = tarfile.NUL
# Write null_bytes to get the last block to tarfile.BLOCKSIZE
fileobj.write(null_bytes * (tarfile.BLOCKSIZE - remainder))
blocks += 1
# Increase the offset by the amount already saved to the tar
tar.offset += blocks * tarfile.BLOCKSIZE
Comment on lines -304 to -317
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears we don't actually need to be tracking/updating tar.offset here.

Copy link
Collaborator Author

@forsyth2 forsyth2 Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After following addfile back through the git blame functionality, it appears to be from the very beginning of zstash: https://github.com/E3SM-Project/zstash/pull/2/files#diff-51246e53255db77c9edad496f074aa1bdbf8dbdc11f89a02040115c9ab4fa7f0 has the following.

# Add file to tar archive while computing its hash
# Return file offset (in tar archive), size and md5 hash
def addfile(tar, file):
    offset = tar.offset
    tarinfo = tar.gettarinfo(file)
    tar.addfile(tarinfo)
    if tarinfo.isfile():
        f = open(file, "rb")
        hash_md5 = hashlib.md5()
        while True:
            s = f.read(BLOCK_SIZE)
            if len(s) > 0:
                tar.fileobj.write(s)
                hash_md5.update(s)
            if len(s) < BLOCK_SIZE:
                blocks, remainder = divmod(tarinfo.size, tarfile.BLOCKSIZE)
                if remainder > 0:
                    tar.fileobj.write(tarfile.NUL *
                                      (tarfile.BLOCKSIZE - remainder))
                    blocks += 1
                tar.offset += blocks * tarfile.BLOCKSIZE
                break
        f.close()
        md5 = hash_md5.hexdigest()
    else:
        md5 = None
    size = tarinfo.size
    mtime = datetime.utcfromtimestamp(tarinfo.mtime)
    return offset, size, mtime, md5

So, I just want to make sure the new tar.addfile takes care of the extra stuff in this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading of the documentation is that yes, the tar file is updated already with the tar.add_file() call when a fileobj is passed:
https://docs.python.org/3/library/tarfile.html

The advantage of the original implementation was probably that each file got opened once only and was added to the archive at the same time as computing its md5 hash. The new code reopens the file to create the md5 hash but I think that's fine (and seemingly unavoidable with the new API).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but I cannot approve this change. The ability to stream data to tar files and compute hashes at the same time was a key functionality of zstash. Computation of md5 hashes is expensive, removing this functionality will have a significant performance impact.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude's feedback:

If the old implementation breaks on Python 3.13 (likely due to changes in how tarfile handles internal file objects), then you're facing a compatibility vs. performance trade-off.

Options to Consider

1. Accept the performance hit (pragmatic choice)

  • The new implementation works on Python 3.13
  • Performance regression may be acceptable depending on:
    • Typical file sizes in your use case
    • Available I/O bandwidth (SSDs make double-reads less painful)
    • Whether this is a bottleneck in practice

2. Implement a hybrid solution (best of both worlds)

def add_file(tar, file_name, follow_symlinks):
    offset = tar.offset
    tarinfo = tar.gettarinfo(file_name)
    
    if tarinfo.islnk():
        tarinfo.size = os.path.getsize(file_name)
    
    md5 = None
    
    # For files/hardlinks, stream data while computing hash
    if (tarinfo.isfile() or tarinfo.islnk()) and tarinfo.size > 0:
        hash_md5 = hashlib.md5()
        
        # Create a wrapper that computes hash while data passes through
        class HashingFileWrapper:
            def __init__(self, fileobj, hasher):
                self.fileobj = fileobj
                self.hasher = hasher
            
            def read(self, size=-1):
                data = self.fileobj.read(size)
                if data:
                    self.hasher.update(data)
                return data
        
        with open(file_name, "rb") as f:
            wrapper = HashingFileWrapper(f, hash_md5)
            tar.addfile(tarinfo, wrapper)
        
        md5 = hash_md5.hexdigest()
    else:
        tar.addfile(tarinfo)
    
    size = tarinfo.size
    mtime = datetime.utcfromtimestamp(tarinfo.mtime)
    return offset, size, mtime, md5

This wrapper approach:

  • ✅ Uses the proper Python 3.13-compatible API
  • ✅ Maintains single-pass streaming
  • ✅ Computes hash during the tar write operation
  • ✅ Clean, maintainable code

3. Version-specific implementations

Use the old code for Python <3.13 and new code for 3.13+, but this adds maintenance burden.

My Recommendation

Try the hybrid solution first (Option 2). It should give you Python 3.13 compatibility while preserving the streaming performance that Reviewer 2 correctly identified as important. If tar.addfile() properly uses the file-like object's read() method, this should work perfectly.

If the wrapper approach doesn't work for some reason, then you'll need to accept the performance regression as the cost of Python 3.13 support, but at least you'll have tried to preserve the optimization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@golaz, I will veto any solution that does not allow us to advance to the latest stable python (python 3.13), since this is something I have been fighting to have happen in Unified for years and it has finally become possible because of dropping CDAT.

It seems like zstash has not been getting the TLC it has needed over the years to remain compatible with modern python and that's pretty disappointing. I think @forsyth2 is spread too thin and no one else has stepped up. Goodness knows I'm spread too thin.

Hopefully, we can get things there for Unified, since testing should start on Monday. Hopefully, we can also find a longer term solution for zstash to have a "champion".

Copy link
Collaborator

@chengzhuzhang chengzhuzhang Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I haven’t followed this thread too closely. Which Python version is causing the incompatibility with zstash? I think it’s a solid achievement that we could update from 3.10 (now end-of-life) to 3.11 (or higher, if our tools are ready) for Unified with all the work we put in to migrate from CDAT. In this case, it seems zstash isn’t ready yet, and we are time-constrained to update it with thorough testing.

Can we target Python 3.11 or 3.12 (depending on zstash readiness) for this upcoming unified? Both are still widely used, and 3.11 will continue receiving security updates until October 2027. Again, for a tool like E3SM-unified, I believe we should continue prioritizing maximum compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AAAAAAaargh!!

Okay, I think it would be important to find out if zstash works with python 3.12 and without requiring these changes. If we only support python 3.11 in zstash and Unified, that's not good. That's the state we just got out of with dropping CDAT.

I also never want this to happen again with a future Unified. All our software must, must, must support the 3 latest stable versions of python at all times. This is just necessary for our ecosystem to function well. Otherwise, the neglected tools really drag down the tools that are trying to take advantage of more modern capabilities.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could get behind the concept of option 2.
Hopefully, we can get things there for Unified, since testing should start on Monday.

I've implemented this in the latest commit: cc37237. It required an additional change to handle empty files. It now passes the GitHub Actions for 3.11, 3.12, and 3.13.

I believe this means this PR should be good to merge. @xylar @golaz Can you please re-confirm based on this latest commit?

zstash has not been getting the TLC it has needed over the years to remain compatible with modern python
Hopefully, we can also find a longer term solution for zstash to have a "champion".

Yes, I agree with all this. After the Unified testing period, perhaps it would be useful to have a team discussion about this. The problem is certainly not a lack of ideas re: code maintainability (e.g., removing all those global variables or improving the testing).

Which Python version is causing the incompatibility with zstash?

3.13

find out if zstash works with python 3.12 and without requiring these changes.

It does. I only ran into errors when I started matrix-testing on 3.13.

All our software must, must, must support the 3 latest stable versions of python at all times.

This is good to keep in mind, thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we target Python 3.11 or 3.12 (depending on zstash readiness) for this upcoming unified? Both are still widely used, and 3.11 will continue receiving security updates until October 2027. Again, for a tool like E3SM-unified, I believe we should continue prioritizing maximum compatibility.

I also never want this to happen again with a future Unified. All our software must, must, must support the 3 latest stable versions of python at all times. This is just necessary for our ecosystem to function well. Otherwise, the neglected tools really drag down the tools that are trying to take advantage of more modern capabilities.
Here’s a tightened version of what you have now—clear, direct, and avoids repetition:

I agree with @chengzhuzhang that maximizing compatibility is essential, and with @xylar that we must support the latest stable Python versions (currently 3.11, 3.12, 3.13). Falling behind with Python support (mainly due to CDAT) means we’re not truly maximizing compatibility and leaves us playing catch-up with the broader ecosystem.

Our sub-dependencies (e.g., numpy) already expect newer Python versions, which forces @xylar into hacks and workarounds to get things working. Major scientific packages generally follow this Python support spec: https://scientific-python.org/specs/spec-0000/. We might consider adopting this more formally as a guideline to ensure our packages are always ready moving forward.

data: bytes = f.read(BLOCK_SIZE)
if len(data) > 0:
hash_md5.update(data)
if len(data) < BLOCK_SIZE:
break
f.close()
md5 = hash_md5.hexdigest()
f.close()
size: int = tarinfo.size
mtime: datetime = datetime.utcfromtimestamp(tarinfo.mtime)
return offset, size, mtime, md5
Loading