Skip to content

Conversation

@forsyth2
Copy link
Collaborator

@forsyth2 forsyth2 commented Mar 7, 2025

Extremely early, experimental draft of what a zstash refactor would look like. Specifically, the refactor would store as much state as possible in an object rather than passing around many variables (especially global variables).

@forsyth2 forsyth2 self-assigned this Mar 7, 2025
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Mar 7, 2025

Advantages:

  • State could be easily "snapshotted". In particular, it would be easier to track state of Globus transfers (as in Non block testing fix #363).
  • Easier handling of HPSS (e.g., the HPSSType Enum of this PR allows us to immediately see which HPSS scenario we're in rather than repeatedly parsing the HPSS path

Obstacles:

  • Unclear at the moment what Config class is for and if the CommandInfo of this PR would be better off combined with it
  • Extensive changes to parameter passing.

@forsyth2 forsyth2 mentioned this pull request Mar 7, 2025
@forsyth2
Copy link
Collaborator Author

Other suggestions, from comments on #363:

  • Make unit tests actually be unit tests. Can we test individual Python functions in any way, when so much of zstash relies on writing to a directory, not computing output?

Comments from @TonyB9000:

I would like to refactor/merge both "globus_wait()" and "globus_block_wait()"
I would like (eventually) to have (input) path be an added (optional) parameter for "update"

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 1, 2025

Currently going through commands systematically, moving state data into a CommandInfo object. Currently on create.

This removes the need for global variables and makes it easier to get a "snapshot" of the parameters at any given point.

@TonyB9000
Copy link
Collaborator

@forsyth2 Excellent. This will certainly help keep things straight.

@TonyB9000
Copy link
Collaborator

@forsyth2 (Aside: Why are #349 and #370 not listed under the "issues" tab? That is where I was looking ...)

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 1, 2025

@TonyB9000 They started as prototyping attempts, meaning I already had some code. So, they went under pull requests instead of issues.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 9, 2025

The first commit (bd08c60) is my refactor to use objects. The second commit (b04221a) is my debugging of the double-authentication issue.

Using these commits, if I run the following on Perlmutter:

mkdir zstash_globus_setup
cd zstash_globus_setup
mkdir zstash_demo; echo 'file0 stuff' > zstash_demo/file0.txt
rm ~/.globus-native-apps.cfg
# Check I'm logged into "NERSC Perlmutter" and "Globus Tutorial Collection 1" on globus.org
zstash create --verbose --hpss=globus://6c54cade-bde5-45c1-bdea-f4bd71dba2cc/~/manual_run zstash_demo

then I get:

DEBUG: local endpoint=NERSC Perlmutter
DEBUG: remote endpoint=Globus Tutorial Collection 1
DEBUG: globus_activate. Calling login, which may print 'Please Paste your Auth Code Below:'
INFO: NoSavedTokens: No tokens were loaded
INFO: Starting Native App Grant Flow
...
Please Paste your Auth Code Below:
...
...
...
DEBUG: local endpoint=NERSC Perlmutter
DEBUG: remote endpoint=Globus Tutorial Collection 1
DEBUG: submit_transfer_with_checks. Calling login, which may print 'Please Paste your Auth Code Below:'
INFO: ScopesMismatch: Requested scopes not found: {'*https://auth.globus.org/scopes/6c54cade-bde5-45c1-bdea-f4bd71dba2cc/data_access', '*https://auth.globus.org/scopes/6bdc7956-fc0f-4ad2-989c-7aa5ee643a79/data_access', ']', 'urn:globus:auth:scope:transfer.api.globus.org:all['}. A login is required.
...
Please Paste your Auth Code Below:

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 9, 2025

To remove consents, to start fresh:

https://app.globus.org/settings/consents
Manage Your Consents
Globus Endpoint Performance Monitoring
rescind all

zstash/globus.py Outdated
Comment on lines 88 to 92
scopes = "urn:globus:auth:scope:transfer.api.globus.org:all["
for ep_id in [remote_endpoint, local_endpoint]:
if check_endpoint_version_5(ep_id):
for ep_id in [globus_info.remote_endpoint, globus_info.local_endpoint]:
if check_endpoint_version_5(globus_info, ep_id):
scopes += f" *https://auth.globus.org/scopes/{ep_id}/data_access"
scopes += " ]"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do this scope expansion until after transfer_client gets set. It appears the two authentications are 1) initiating the native client and 2) adding the scopes of the local and remote endpoints to it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forsyth2 Although I'd never looked very hard, I had assume the first involved Globus recognizing the account as a valid Globus user, and then conducting ep-authentication/scoping. Surprised there are not 3 of these (globus, ep-1, ep-2)

I have gotten different results (even in the globus web UI) by when authenticating to endpoints in a different order.

Comment on lines 169 to 139
native_client = NativeClient(
client_id="6c1629cf-446c-49e7-af95-323c6412397f",
client_id=ZSTASH_CLIENT_ID,
app_name="Zstash",
default_scopes="openid urn:globus:auth:scope:transfer.api.globus.org:all",
)
log_current_endpoints(globus_info)
logger.debug(
"globus_activate. Calling login, which may print 'Please Paste your Auth Code Below:'"
)
native_client.login(no_local_server=True, refresh_tokens=True)
transfer_authorizer = native_client.get_authorizers().get("transfer.api.globus.org")
transfer_client = TransferClient(authorizer=transfer_authorizer)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#349 has an initial implementation of how this code block might be replaced.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 9, 2025

Reviewing the list of steps in #339:

  1. Login to Globus web interface and activate end points -- I don't think there's going to be any way around this. People will need to be logged into activated endpoints.
  2. Delete existing globus cfg file -- it seems feasible we could just have zstash auto-delete ~/.globus-native-apps.cfg at the beginning of each run.
  3. Start interactive zstash test transfer [...] will ask to copy and paste authorization code twice. zstash runs but files are not transferred. -- this is the code blocks identified above: Refactor zstash #370 (comment), Refactor zstash #370 (comment). In theory, these can be improved.
  4. Start a second interactive zstash transfer [...] This one should complete without issue or any prompt. -- this appears to just be a check that zstash is working before running a longer transfer.
  5. Start long transfer [...] Note that this transfer is limited to 48 hours due to Globus token expiration. -- Is this something we change in zstash or do we need to talk the people managing the endpoints?

Per #339, in the past, steps 2-4 weren't required at all.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 9, 2025

@TonyB9000 It looks like I have a working refactor; The first commit, bd08c60, passes the unit tests:

python -m unittest tests/test_*.py
python -m unittest tests2/test_*.py

In the second commit, b04221a, I try to debug the authentications issue.

My current plan is to merge the refactor and the globus fixes as separate PRs (in any case, certainly as separate commits) -- but the refactor is needed as a new baseline for the globus fixes.

We should meet to discuss the globus fixes further, but in the meantime you can take a look at that second commit and the comments I've made here.

@TonyB9000
Copy link
Collaborator

@forsyth2 Regarding "Reviewing the list of steps in #339:". If this was a "once-per-month" thing it would be tolerable for automation. I can understand different sites not "trusting one another" in terms of authentication (like, "I see you qualified as USER-at-site-1, so I will accept you as USER-at-site-2", so not to propagate a compromised account. (I never tried this, but in Globus Web, I suppose you could cycle through 5 different collections hosted at 5 different sites, satisfy the authentication at each one, and thereafter move from one collection to another at will. But they need to last longer.

The obstacle to automation, as I see it, is that you need to authenticate to at least 3 parties (globus, party-1, party2) and you get knocked out by whoever has the shortest expiration.

I agree with auto-deleting the globus config file. I would prefer the user never need to know such a file exists

As far as zstash is concerned, I thought the first exercised transfer was to fetch (or remote-create) the index.db file, which can occur very fast (isn;t that what "zstash -ls" performs?)

We can meet at your convenience. I am rebuilding conda environment for my dsm-testing. Major rewrite of the slurm/srun job-launching system to avoid hangs. (I did confirm that there are random hangs of job in a sequence, and would love to know why..)

zstash/globus.py Outdated
"The {} endpoint is not activated or the current activation expires soon. Please go to https://app.globus.org/file-manager/collections/{} and (re)activate the endpoint.".format(
ep_id, ep_id
)
f"The {ep_id} endpoint is not activated or the current activation expires soon. Please go to https://app.globus.org/file-manager/collections/{ep_id} and (re)activate the endpoint."
Copy link
Collaborator Author

@forsyth2 forsyth2 Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can infer what the endpoints are going to be later on in the code, based on the current machine and the hpss path (using Mache maybe?).

Then, we could check endpoint activation earlier, but even then, looking at this code block, we'd need transfer_client set as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd need transfer_client set as well.

It turns out we don't! check_endpoint_version_5 needs it to be set, but we actually don't need to call that function if the consents are working fine in the first place.

zstash/globus.py Outdated
"submit_transfer_with_checks. Calling login, which may print 'Please Paste your Auth Code Below:'"
)
native_client.login(requested_scopes=scopes)
# Quit here and tell user to re-try
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "Consents added, please re-run the previous command to start transfer" is a real obstacle to users. It basically means they have to run a toy version of zstash first, as specified in #329 and #339. We need to at least try to immediately alert the user that consents aren't set up rather than waiting until we try to transfer data.

Copy link
Collaborator

@TonyB9000 TonyB9000 Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. There should be a globus function specific to this issue..

def has_consent(endpoint, scopes):
    . . .
    return [True, False]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forsyth2 Without doing "damage", is there a way we could code a module (or function) that emulates "has_consent()" above, going as far as obtaining a transfer client (for a "toy" transfer, like "ls") and hides the details from the user?

(I don't really understand what is gained by registering for a "client_id". Is this intended to shorten the authentication steps?) If I write an application that has no registered client_id, can I still use the globus_sdk to code up functioning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm trying to do some prototyping of what that would look like.

Copy link
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TonyB9000 I added another commit, 96677fa, that mostly pulls out some code into helper functions.

Comment on lines 71 to 83
def set_clients(globus_info: GlobusInfo):
native_client = NativeClient(
client_id=ZSTASH_CLIENT_ID,
app_name="Zstash",
default_scopes="openid urn:globus:auth:scope:transfer.api.globus.org:all",
)
log_current_endpoints(globus_info)
logger.debug(
"set_clients. Calling login, which may print 'Please Paste your Auth Code Below:'"
)
native_client.login(no_local_server=True, refresh_tokens=True)
transfer_authorizer = native_client.get_authorizers().get("transfer.api.globus.org")
globus_info.transfer_client = TransferClient(authorizer=transfer_authorizer)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially tried to change out this logic with that in #349, but it didn't seem to change terribly much.

It actually also complicated things because it uses a different object type than the native_client instantiation that's now found in check_consents()

zstash/globus.py Outdated
Comment on lines 310 to 313
set_clients(globus_info)
# Causes globus_sdk.services.auth.errors.AuthAPIError:
# ('POST', 'https://auth.globus.org/v2/oauth2/token', None, 400, 'Error', 'invalid_grant')
# check_consents(globus_info)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to check consents as early as possible (i.e., as soon as we have a transfer_client) just results in an invalid_grant error, so that's no good.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forsyth2 At some point, we should rope in some of those helpful globus service folk, and ask why we cannot obtain this information independent of issuing a transfer. This seems like a "useful" feature to me.

zstash/utils.py Outdated
Comment on lines 95 to 100
globus_cfg: str = os.path.expanduser("~/.globus-native-apps.cfg")
logger.info(f"Checking if {globus_cfg} exists")
if os.path.exists(globus_cfg):
logger.info(f"Removing {globus_cfg}")
# Otherwise, may cause "Token is not active" TransferAPIError
os.remove(globus_cfg)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to auto-remove that cfg that was causing problems, so that at least removes step 2 ("Delete existing globus cfg file") from the process.

@forsyth2
Copy link
Collaborator Author

Hmm the test_globus.py unit test is hanging with the latest commit. The hang appears to happen because the process that the Python test creates is waiting for user input (the auth codes), but nothing for the process gets printed to the terminal window.

It looks like test_globus.py duplicates a lot of code in globus.py, which is a little confusing. That's going to need to take some debugging.

@forsyth2
Copy link
Collaborator Author

Even if I run the exact command the test is running (zstash create --cache=zstash --hpss=globus://6c54cade-bde5-45c1-bdea-f4bd71dba2cc/~/zstash_test/ zstash_test), beforehand and authenticate twice, the test still hangs.

@TonyB9000
Copy link
Collaborator

@forsyth2 I've never looked at "test_globus.py", but (in principle) a test-module should be a driver that imports and exercises the functions of the module being tested, and not "re-code" the same functions (otherwise you are only testing the test-module.). (That may be easier said tghan done, of course).

@forsyth2
Copy link
Collaborator Author

not "re-code" the same functions

@TonyB9000 Yeah, I'm not exactly sure why it was done this way; I would guess to make sure Globus is set up / teared down correctly.

Overall, zstash testing is pretty annoying because basically all of the functionality is in bash, but the unit tests wrap all that in Python. That can cause problems, e.g. from my earlier comment:

"The hang appears to happen because the process that the Python test creates is waiting for user input (the auth codes), but nothing for the process gets printed to the terminal window."

I'm almost wondering if the best way to test zstash is to just write up an assert function in a bash and run a variety of bash scripts representing different zstash workflows.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 10, 2025

all of the functionality is in bash

By that I mean, while zstash is coded in Python, it is called from bash with various command line options and often in sequence with other zstash calls.

run a variety of bash scripts representing different zstash workflows.

Perhaps these would be integration tests (#369) and true unit tests would remain in Python. The issue with true unit tests is that so much of zstash affects disk or database or HPSS, so there is very little that is truly I/O-independent.

@TonyB9000
Copy link
Collaborator

@forsyth2 Could you name the folder and script for me? I have the branch "refactor-zstash" but don;t know where to look. I'll borrow from it when I get beck this afternoon.

@forsyth2
Copy link
Collaborator Author

@TonyB9000 I simplified the script further, see 40a3d58 (now fewer than 200 lines). It's examples/simple_globus.py. I can also copy it here:

Minimal Example Script
import configparser
import os
import re
import shutil
from typing import Optional
from urllib.parse import ParseResult, urlparse

from fair_research_login.client import NativeClient
from globus_sdk import TransferAPIError, TransferClient, TransferData
from globus_sdk.response import GlobusHTTPResponse

# Minimal example of how Globus is used in zstash
# 1. Log into endpoints at globus.org
# 2. To start fresh, with no consents:
# https://app.globus.org/settings/consents > Manage Your Consents > Globus Endpoint Performance Monitoring > rescind all"

HSI_DIR = "zstash_debugging_20250415_v2"

# Globus-specific settings ####################################################
GLOBUS_CFG: str = os.path.expanduser("~/.globus-native-apps.cfg")
INI_PATH: str = os.path.expanduser("~/.zstash.ini")
ZSTASH_CLIENT_ID: str = "6c1629cf-446c-49e7-af95-323c6412397f"
NAME_TO_ENDPOINT_MAP = {
    # "Globus Tutorial Collection 1": "6c54cade-bde5-45c1-bdea-f4bd71dba2cc",  # The Unit test endpoint
    "NERSC HPSS": "9cd89cfd-6d04-11e5-ba46-22000b92c6ec",
    "NERSC Perlmutter": "6bdc7956-fc0f-4ad2-989c-7aa5ee643a79",
}


# Functions ###################################################################
def main():
    base_dir = os.getcwd()
    print(f"Starting in {base_dir}")
    if os.path.exists(INI_PATH):
        os.remove(INI_PATH)
    if os.path.exists(GLOBUS_CFG):
        os.remove(GLOBUS_CFG)
    try:
        simple_transfer("toy_run")
    except RuntimeError:
        print("Now that we have the authentications, let's re-run.")
    # /global/homes/f/forsyth/.globus-native-apps.cfg does not exist. zstash will need to prompt for authentications twice, and then you will need to re-run.
    #
    # Might ask for 1st authentication prompt:
    # Please paste the following URL in a browser:
    # Authenticated for the 1st time!
    #
    # Might ask for 2nd authentication prompt:
    # Please paste the following URL in a browser:
    # Authenticated for the 2nd time!
    # Consents added, please re-run the previous command to start transfer
    # Now that we have the authentications, let's re-run.
    os.chdir(base_dir)
    print(f"Now in {os.getcwd()}")
    assert os.path.exists(INI_PATH)
    assert os.path.exists(GLOBUS_CFG)
    simple_transfer("real_run")
    # /global/homes/f/forsyth/.globus-native-apps.cfg exists. If this file does not have the proper settings, it may cause a TransferAPIError (e.g., 'Token is not active', 'No credentials supplied')
    #
    # Might ask for 1st authentication prompt:
    # Authenticated for the 1st time!
    #
    # Bypassed 2nd authentication.
    #
    # Wait for task to complete, wait_timeout=300
    print(f"To see transferred files, run: hsi ls {HSI_DIR}")
    # To see transferred files, run: hsi ls zstash_debugging_20250415_v2
    # Shows file0.txt


def simple_transfer(run_dir: str):
    hpss_path = f"globus://{NAME_TO_ENDPOINT_MAP['NERSC HPSS']}/~/{HSI_DIR}"
    if os.path.exists(run_dir):
        shutil.rmtree(run_dir)
    os.mkdir(run_dir)
    os.chdir(run_dir)
    print(f"Now in {os.getcwd()}")
    dir_to_archive: str = "dir_to_archive"
    txt_file: str = "file0.txt"
    os.mkdir(dir_to_archive)
    with open(f"{dir_to_archive}/{txt_file}", "w") as f:
        f.write("file contents")
    url: ParseResult = urlparse(hpss_path)
    assert url.scheme == "globus"
    if os.path.exists(GLOBUS_CFG):
        print(
            f"{GLOBUS_CFG} exists. If this file does not have the proper settings, it may cause a TransferAPIError (e.g., 'Token is not active', 'No credentials supplied')"
        )
    else:
        print(
            f"{GLOBUS_CFG} does not exist. zstash will need to prompt for authentications twice, and then you will need to re-run."
        )
    config_path: str = os.path.abspath(dir_to_archive)
    assert os.path.isdir(config_path)
    remote_endpoint: str = url.netloc
    # Simulate globus_activate > set_local_endpoint
    ini = configparser.ConfigParser()
    local_endpoint: Optional[str] = None
    if ini.read(INI_PATH):
        if "local" in ini.sections():
            local_endpoint = ini["local"].get("globus_endpoint_uuid")
    else:
        ini["local"] = {"globus_endpoint_uuid": ""}
        with open(INI_PATH, "w") as f:
            ini.write(f)
    if not local_endpoint:
        nersc_hostname = os.environ.get("NERSC_HOST")
        assert nersc_hostname == "perlmutter"
        local_endpoint = NAME_TO_ENDPOINT_MAP["NERSC Perlmutter"]
    native_client = NativeClient(
        client_id=ZSTASH_CLIENT_ID,
        app_name="Zstash",
        default_scopes="openid urn:globus:auth:scope:transfer.api.globus.org:all",
    )
    # May print 'Please Paste your Auth Code Below:'
    # This is the 1st authentication prompt!
    print("Might ask for 1st authentication prompt:")
    native_client.login(no_local_server=True, refresh_tokens=True)
    print("Authenticated for the 1st time!")
    transfer_authorizer = native_client.get_authorizers().get("transfer.api.globus.org")
    transfer_client: TransferClient = TransferClient(authorizer=transfer_authorizer)
    for ep_id in [
        local_endpoint,
        remote_endpoint,
    ]:
        r = transfer_client.endpoint_autoactivate(ep_id, if_expires_in=600)
        assert r.get("code") != "AutoActivationFailed"
    os.chdir(config_path)
    print(f"Now in {os.getcwd()}")
    url_path: str = str(url.path)
    assert local_endpoint is not None
    src_path: str = os.path.join(os.getcwd(), txt_file)
    dst_path: str = os.path.join(url_path, txt_file)
    subdir = os.path.basename(os.path.normpath(url_path))
    subdir_label = re.sub("[^A-Za-z0-9_ -]", "", subdir)
    filename = txt_file.split(".")[0]
    label = subdir_label + " " + filename
    transfer_data: TransferData = TransferData(
        transfer_client,
        local_endpoint,  # src_ep
        remote_endpoint,  # dst_ep
        label=label,
        verify_checksum=True,
        preserve_timestamp=True,
        fail_on_quota_errors=True,
    )
    transfer_data.add_item(src_path, dst_path)
    transfer_data["label"] = label
    task: GlobusHTTPResponse
    try:
        task = transfer_client.submit_transfer(transfer_data)
        print("Bypassed 2nd authentication.")
    except TransferAPIError as err:
        if err.info.consent_required:
            scopes = "urn:globus:auth:scope:transfer.api.globus.org:all["
            for ep_id in [remote_endpoint, local_endpoint]:
                scopes += f" *https://auth.globus.org/scopes/{ep_id}/data_access"
            scopes += " ]"
            native_client = NativeClient(client_id=ZSTASH_CLIENT_ID, app_name="Zstash")
            # May print 'Please Paste your Auth Code Below:'
            # This is the 2nd authentication prompt!
            print("Might ask for 2nd authentication prompt:")
            native_client.login(requested_scopes=scopes)
            print("Authenticated for the 2nd time!")
            print(
                "Consents added, please re-run the previous command to start transfer"
            )
            raise RuntimeError("Re-run now that authentications are set up!")
        else:
            if err.info.authorization_parameters:
                print("Error is in authorization parameters")
            raise err
    task_id = task.get("task_id")
    wait_timeout = 300  # 300 sec = 5 min
    print(f"Wait for task to complete, wait_timeout={wait_timeout}")
    transfer_client.task_wait(task_id, timeout=wait_timeout, polling_interval=10)
    curr_task: GlobusHTTPResponse = transfer_client.get_task(task_id)
    task_status = curr_task["status"]
    assert task_status == "SUCCEEDED"


# Run #########################################################################
if __name__ == "__main__":
    main()
How to get latest code locally
git status
# Make sure there are no changes that could be wiped or cause merge conflicts when we switch branches
git remote -v
# Look for the one that is associated with "[email protected]:E3SM-Project/zstash.git"
# For me, that's "upstream"
git fetch upstream refactor-zstash
# Now do one of the following:
# git rebase upstream/refactor-zstash # Applies latest commits from the branch on GitHub.
# git reset --hard upstream/refactor-zstash # Resets local branch to exactly match the branch on GitHub.

@TonyB9000
Copy link
Collaborator

@forsyth2 I would have thought this would work:

(dsm_test_e2c) [ac.bartoletti1@chrlogin2 zstash]$ git fetch origin refactor-zstash
From github.com:E3SM-Project/zstash
 * branch            refactor-zstash -> FETCH_HEAD
(dsm_test_e2c) [ac.bartoletti1@chrlogin2 zstash]$ git checkout refactor-zstash
branch 'refactor-zstash' set up to track 'origin/refactor-zstash'.
Switched to a new branch 'refactor-zstash'

But then "examples" only shows "zstash_create_globus.py", So I will try your method...

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Apr 15, 2025

@TonyB9000 Not quite. You need an additional command.

git fetch origin refactor-zstash
# This updates your local git's "knowledge" of what's on the branch "refactor-zstash" on your remote "origin" (i.e., what's on GitHub)
git checkout refactor-zstash
# This tells your local git to switch you to your existing local branch "refactor-zstash"

Then either:

git reset --hard origin/refactor-zstash
# This is how to tell git "hey, actually make my local branch match what's on GitHub"

or

git rebase origin/refactor-zstash
# This is how to tell git "hey, keep anything extra I added, but put any of my changes on top of the latest from GitHub"

@forsyth2
Copy link
Collaborator Author

Alternatively, create a new branch based on the GitHub branch with:

git fetch origin refactor-zstash
git checkout -b my-own-refactor-zstash origin/refactor-zstash

@TonyB9000
Copy link
Collaborator

TonyB9000 commented Apr 15, 2025

@forsyth2 Thanks Ryan. When I used "git fetch origin refactor-zstash" and then "git checkout refactor-zstash", I used no "-b", so I assumed it knew I was not creating my own new branch, but rather copying the existing branch from remote.

However, I have since applied git reset --hard origin/refactor-zstash ands I still don't have the correct "examples".

Also git rebase origin/refactor-zstash replies with Current branch refactor-zstash is up to date.

Does it matter that I began all of this with a fresh "git clone" of zstash?

@TonyB9000
Copy link
Collaborator

@forsyth2 Worse:

(dsm_test_e2c) [ac.bartoletti1@chrlogin2 zstash]$ git remote -v
origin  [email protected]:E3SM-Project/zstash.git (fetch)
origin  [email protected]:E3SM-Project/zstash.git (push)
(dsm_test_e2c) [ac.bartoletti1@chrlogin2 zstash]$ git fetch upstream refactor-zstash
fatal: 'upstream' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

@forsyth2
Copy link
Collaborator Author

@TonyB9000 Just use origin; I set my remote name to be upstream rather than origin

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Jul 23, 2025

@TonyB9000 Remarkably, this script works! (However, I should note it still uses the fair_research_login.client NativeClient).

Script
import configparser
import os
import re
import shutil
from typing import List, Optional
from urllib.parse import ParseResult, urlparse

from fair_research_login.client import NativeClient
from globus_sdk import TransferAPIError, TransferClient, TransferData
from globus_sdk.response import GlobusHTTPResponse

# Minimal example of how Globus is used in zstash
# 1. Log into endpoints at globus.org
# File Manager > Add the endpoints in the "Collection" fields
# 2. To start fresh, with no consents:
# https://auth.globus.org/v2/web/consents > Manage Your Consents > Globus Endpoint Performance Monitoring > rescind all"

HSI_DIR: str = "zstash_test_370_20250723"
ENDPOINT_NAME: str = "LCRC Improv DTN"  # Change this to the name of the endpoint you want to use
REQUEST_SCOPES_EARLY: bool = True  # False will emulate zstash behavior

# Globus-specific settings ####################################################
GLOBUS_CFG: str = os.path.expanduser("~/.globus-native-apps.cfg")
INI_PATH: str = os.path.expanduser("~/.zstash.ini")
ZSTASH_CLIENT_ID: str = "6c1629cf-446c-49e7-af95-323c6412397f"
NAME_TO_ENDPOINT_MAP = {
    # "Globus Tutorial Collection 1": "6c54cade-bde5-45c1-bdea-f4bd71dba2cc",  # The Unit test endpoint
    "NERSC HPSS": "9cd89cfd-6d04-11e5-ba46-22000b92c6ec",
    "NERSC Perlmutter": "6bdc7956-fc0f-4ad2-989c-7aa5ee643a79",
    "LCRC Improv DTN": "15288284-7006-4041-ba1a-6b52501e49f1"
}


# Functions ###################################################################
def main():
    base_dir = os.getcwd()
    print(f"Starting in {base_dir}")
    if os.path.exists(INI_PATH):
        os.remove(INI_PATH)
    if os.path.exists(GLOBUS_CFG):
        os.remove(GLOBUS_CFG)
    skipped_second_auth: bool = False
    try:
        skipped_second_auth = simple_transfer("toy_run")
    except RuntimeError:
        print("Now that we have the authentications, let's re-run.")
    print(f"For toy_run, skipped_second_auth={skipped_second_auth}")
    if skipped_second_auth:
        # We want to enter this block!
        print("We didn't need to authenticate a second time! That means we don't have to re-run the previous command to start the transfer!")
    else:
        # Previously, we ended up in this block!
        #
        # /global/homes/f/forsyth/.globus-native-apps.cfg does not exist. zstash will need to prompt for authentications twice, and then you will need to re-run.
        #
        # Might ask for 1st authentication prompt:
        # Please paste the following URL in a browser:
        # Authenticated for the 1st time!
        #
        # Might ask for 2nd authentication prompt:
        # Please paste the following URL in a browser:
        # Authenticated for the 2nd time!
        # Consents added, please re-run the previous command to start transfer
        # Now that we have the authentications, let's re-run.
        os.chdir(base_dir)
        print(f"Now in {os.getcwd()}")
        assert os.path.exists(INI_PATH)
        assert os.path.exists(GLOBUS_CFG)
        skipped_second_auth = simple_transfer("real_run")
        print(f"For real_run, skipped_second_auth={skipped_second_auth}")
        # /global/homes/f/forsyth/.globus-native-apps.cfg exists. If this file does not have the proper settings, it may cause a TransferAPIError (e.g., 'Token is not active', 'No credentials supplied')
        #
        # Might ask for 1st authentication prompt:
        # Authenticated for the 1st time!
        #
        # Bypassed 2nd authentication.
        #
        # Wait for task to complete, wait_timeout=300
        print(f"To see transferred files, run: hsi ls {HSI_DIR}")
        # To see transferred files, run: hsi ls zstash_debugging_20250415_v2
        # Shows file0.txt
    assert skipped_second_auth


def simple_transfer(run_dir: str) -> bool:
    hpss_path = f"globus://{NAME_TO_ENDPOINT_MAP['NERSC HPSS']}/~/{HSI_DIR}"
    if os.path.exists(run_dir):
        shutil.rmtree(run_dir)
    os.mkdir(run_dir)
    os.chdir(run_dir)
    print(f"Now in {os.getcwd()}")
    dir_to_archive: str = "dir_to_archive"
    txt_file: str = "file0.txt"
    os.mkdir(dir_to_archive)
    with open(f"{dir_to_archive}/{txt_file}", "w") as f:
        f.write("file contents")
    url: ParseResult = urlparse(hpss_path)
    assert url.scheme == "globus"
    if os.path.exists(GLOBUS_CFG):
        print(
            f"{GLOBUS_CFG} exists. If this file does not have the proper settings, it may cause a TransferAPIError (e.g., 'Token is not active', 'No credentials supplied')"
        )
    else:
        print(
            f"{GLOBUS_CFG} does not exist. zstash will need to prompt for authentications twice, and then you will need to re-run."
        )
    config_path: str = os.path.abspath(dir_to_archive)
    assert os.path.isdir(config_path)
    remote_endpoint: str = url.netloc
    # Simulate globus_activate > set_local_endpoint
    ini = configparser.ConfigParser()
    local_endpoint: Optional[str] = None
    if ini.read(INI_PATH):
        if "local" in ini.sections():
            local_endpoint = ini["local"].get("globus_endpoint_uuid")
    else:
        ini["local"] = {"globus_endpoint_uuid": ""}
        with open(INI_PATH, "w") as f:
            ini.write(f)
    if not local_endpoint:
        # nersc_hostname = os.environ.get("NERSC_HOST")
        # assert nersc_hostname == "perlmutter"
        local_endpoint = NAME_TO_ENDPOINT_MAP[ENDPOINT_NAME]
    native_client = NativeClient(
        client_id=ZSTASH_CLIENT_ID,
        app_name="Zstash",
        default_scopes="openid urn:globus:auth:scope:transfer.api.globus.org:all",
    )
    # May print 'Please Paste your Auth Code Below:'
    # This is the 1st authentication prompt!
    print("Might ask for 1st authentication prompt:")
    if REQUEST_SCOPES_EARLY:
        all_scopes: str = get_all_endpoint_scopes(NAME_TO_ENDPOINT_MAP.values())
        native_client.login(requested_scopes=all_scopes, no_local_server=True, refresh_tokens=True)
    else:
        native_client.login(no_local_server=True, refresh_tokens=True)
    print("Authenticated for the 1st time!")
    transfer_authorizer = native_client.get_authorizers().get("transfer.api.globus.org")
    transfer_client: TransferClient = TransferClient(authorizer=transfer_authorizer)
    for ep_id in [
        local_endpoint,
        remote_endpoint,
    ]:
        r = transfer_client.endpoint_autoactivate(ep_id, if_expires_in=600)
        assert r.get("code") != "AutoActivationFailed"
    os.chdir(config_path)
    print(f"Now in {os.getcwd()}")
    url_path: str = str(url.path)
    assert local_endpoint is not None
    src_path: str = os.path.join(os.getcwd(), txt_file)
    dst_path: str = os.path.join(url_path, txt_file)
    subdir = os.path.basename(os.path.normpath(url_path))
    subdir_label = re.sub("[^A-Za-z0-9_ -]", "", subdir)
    filename = txt_file.split(".")[0]
    label = subdir_label + " " + filename
    transfer_data: TransferData = TransferData(
        transfer_client,
        local_endpoint,  # src_ep
        remote_endpoint,  # dst_ep
        label=label,
        verify_checksum=True,
        preserve_timestamp=True,
        fail_on_quota_errors=True,
    )
    transfer_data.add_item(src_path, dst_path)
    transfer_data["label"] = label
    task: GlobusHTTPResponse
    skipped_second_auth: bool = False
    try:
        task = transfer_client.submit_transfer(transfer_data)
        print("Bypassed 2nd authentication.")
        skipped_second_auth = True
    except TransferAPIError as err:
        if err.info.consent_required:
            scopes = "urn:globus:auth:scope:transfer.api.globus.org:all["
            for ep_id in [remote_endpoint, local_endpoint]:
                scopes += f" *https://auth.globus.org/scopes/{ep_id}/data_access"
            scopes += " ]"
            native_client = NativeClient(client_id=ZSTASH_CLIENT_ID, app_name="Zstash")
            # May print 'Please Paste your Auth Code Below:'
            # This is the 2nd authentication prompt!
            print("Might ask for 2nd authentication prompt:")
            native_client.login(requested_scopes=scopes)
            print("Authenticated for the 2nd time!")
            print(
                "Consents added, please re-run the previous command to start transfer"
            )
            raise RuntimeError("Re-run now that authentications are set up!")
        else:
            if err.info.authorization_parameters:
                print("Error is in authorization parameters")
            raise err
    task_id = task.get("task_id")
    wait_timeout = 300  # 300 sec = 5 min
    print(f"Wait for task to complete, wait_timeout={wait_timeout}")
    transfer_client.task_wait(task_id, timeout=wait_timeout, polling_interval=10)
    curr_task: GlobusHTTPResponse = transfer_client.get_task(task_id)
    task_status = curr_task["status"]
    assert task_status == "SUCCEEDED"
    return skipped_second_auth


def get_all_endpoint_scopes(endpoints: List[str]) -> str:
    inner = " ".join([f"*https://auth.globus.org/scopes/{ep}/data_access" for ep in endpoints])
    return f"urn:globus:auth:scope:transfer.api.globus.org:all[{inner}]"

# Run #########################################################################
if __name__ == "__main__":
    main()
Thoughts from the AI

Even with the bracketed syntax, Globus sometimes still requires two authentications the first time, especially if the endpoints have never been used before.
This is a known quirk of the Globus Auth flow and is not something you can fully control from the client side.

This double-authentication on first use is an expected behavior of Globus, not a bug in your code.
Using the bracketed scope syntax is the best practice, but it does not always guarantee a single prompt due to how Globus handles consent screens and endpoint activation.

After changing to use bracketed scope syntax:

Best Practice Summary
Always use bracketed scope syntax when requesting access to multiple endpoints up front.
If endpoints are dynamic or user-supplied, consider prompting for all known endpoints at once, or document that a second prompt may occur for new endpoints.

Copy link
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TonyB9000 (also @golaz, if you're interested) This commit (9edaf6f) gets rid of the second authentication, but with two important caveats:

  1. It's still using the fair_research_login client; I don't know if that's a huge deal.
  2. The far bigger deal is that it now requires knowing all the scopes up front. That didn't sound too bad on paper, but:
  • Globus just fails to produce an auth code at all if any scopes are unknown.
  • To get the auth code to paste, you have to be authenticated into all machines, even ones not involved in the transfer. For example, for a LCRC Improv DTN->NERSC HPSS transfer, I had to be logged in not only to LCRC and NERSC, but also PNNL! That's obviously an annoyance and a complete blocker for anyone who doesn't have all 3 accounts. That leaves us with two options: a) Just accept the fact people are going to have to do a second auth paste on a toy run before doing a real run, b) add some sort of parameter for users to pass in their local endpoint too (we know the remote from the hpss path) or else deduce it somehow, perhaps using Mache.
My run
cd ~/ez/zstash
lcrc_conda
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n zstash-370-20250723
conda activate zstash-370-20250723
pre-commit run --all-files
python -m pip install .

cd ../
mkdir zstash_test370_v2
rm ~/.globus-native-apps.cfg
# globus.org > File Manager > select "LCRC Improv DTN", "NERSC HPSS"
# https://auth.globus.org/v2/web/consents > Manage Your Consents > Globus Endpoint Performance Monitoring > rescind all"
mkdir zstash_demo; echo 'file0 stuff' > zstash_demo/file0.txt
zstash create --hpss=globus://NERSC/~/manual_run_2025_07_23 zstash_demo
# Please paste the following URL in a browser:
# => UNKNOWN_SCOPE_ERROR
# client_id=6c1629cf-446c-49e7-af95-323c6412397f requested unknown scopes: ['https://auth.globus.org/scopes/08925f04-569f-11e7-bef8-22000b9a448b/data_access', 'https://auth.globus.org/scopes/de463ec4-6d04-11e5-ba46-22000b92c6ec/data_access']
# 
# Comment out:
#r"theta.*\.alcf\.anl\.gov": "08925f04-569f-11e7-bef8-22000b9a448b",
#"ALCF": "de463ec4-6d04-11e5-ba46-22000b92c6ec",
rm -rf zstash_demo
rm ~/.globus-native-apps.cfg
# rm: cannot remove '/home/ac.forsyth2/.globus-native-apps.cfg': No such file or directory
# 
# https://auth.globus.org/v2/web/consents > Manage Your Consents > Globus Endpoint Performance Monitoring > rescind all"
# No consents
cd ../zstash
pre-commit run --all-files
python -m pip install .
cd ../zstash_test370_v2
mkdir zstash_demo; echo 'file0 stuff' > zstash_demo/file0.txt
zstash create --hpss=globus://NERSC/~/manual_run_2025_07_23 zstash_demo
# Only need to paste one authentication, but you have to log into LCRC, NERSC, and PNNL to get it!

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Jul 24, 2025

@TonyB9000

add some sort of parameter for users to pass in their local endpoint too (we know the remote from the hpss path) or else deduce it somehow, perhaps using Mache.

That's actually not necessary. The latest commit (747feb2) solves this issue. We're actually able to authenticate just once and continue on, no "toy run" needed, now!

Also relevant: #370 (comment) on the topic of needing transfer_client set -- it's not actually necessary either.

@forsyth2
Copy link
Collaborator Author

The Globus epic lists these 3 points:

Reducing the now 7-step Globus process back down to ~2 steps.

These steps are outlined in #339. I believe today's commits resolve this issue.

  1. Login to Globus web interface and activate end points -- This was always needed.
  2. Delete existing globus cfg file -- We found in the distilled scripts that if doesn't exist, zstash will need to prompt for authentications twice. That's no longer an issue with today's commits, so maybe we can just always auto-delete it? We also found that if it did exist, but had improper settings, it would cause a TransferAPIError.
  3. Start interactive zstash test transfer -- Now skippable
  4. Start a second interactive zstash transfer -- Now skippable
  5. Start long transfer -- we can jump straight here, but if we don't have the Globus consents, there is simply no way to get around pasting at least one auth code at the beginning. Crucially though, we don't need to start over after doing that.
  6. Using Globus web interface, manually transfer zstash files that were not transferred due to token expiration. -- This relates to the third point below
  7. Repeat steps (2) to (4) above. Restart zstash archiving that stopped: -- Also relates to the third point below

Simplify the Globus logic.

That's the bulk of this refactor PR, organizing information into objects, and refactoring functions to be easier to comprehend. Before merging of course, I want to do some final code clean up.

Extend the life of a Globus transfer, with refresh tokens.

This is the only one I'm uncertain about. We do have refresh_tokens=True though. We'd need to test on a long simulation. @TonyB9000 Perhaps you could try that using the latest code?

@forsyth2
Copy link
Collaborator Author

We do have refresh_tokens=True though

Actually, that was true on main too, so that can't be it.

@TonyB9000
Copy link
Collaborator

@forsyth2 Ryan, you've done outstanding work here. I read through the thread, and I'm eager to test this. I have plenty of actual transfers I need to perform (from NERSC_HPSS to LCRC_IMPROV_DTN.)

Code-wise, I don't have anything I need to save, so I intend to do a fresh git-clone of zstash and conduct "zstash --check" as a means of pulling over an archive (and not continue with any "zstash extract" at that time).

ASIDE: I was pulling two archives over using the globus web, but 25% in, a NERSC admin suspended the job because it continued to fail on a file presenting permission issues (FILE=zstash_check_20240818.log.gz, How are these used?)

Just this moment, Wuyin Lin got back to me and reset the permissions (global read), So I will see if I can get the admin to resume that transfer.

(QUESTION: If I already have a long-running transfer ongoing via globus-web, would that compromise a "zstash --check" test?)

@forsyth2
Copy link
Collaborator Author

you've done outstanding work here. I read through the thread, and I'm eager to test this. I have plenty of actual transfers I need to perform (from NERSC_HPSS to LCRC_IMPROV_DTN.)

Thanks, great! I'm concerned we haven't solved the "tokens expire too soon" problem, so it would be good to check on that with a long transfer. And debug from there, if needed.

Let me know if you have trouble using the code from this branch.

(FILE=zstash_check_20240818.log.gz, How are these used?

I don't think anything in zstash itself creates logs like that, so it would depend on what program is saving the zstash output to gzipped logs.

If I already have a long-running transfer ongoing via globus-web, would that compromise a "zstash --check" test?

  • On the same data? Wouldn't check just fail then because the transfer's not complete?
  • On different data? I wouldn't think so, but I'm not 100% sure.

@TonyB9000
Copy link
Collaborator

@forsyth2 I have a long-running transfer of "1pctCO2" and "abrupt-4xCO2" archives from NERSC to LCRC. But I have another ~27 archives to transfer (not all at once, but as needed). These would be different transfers - but involve the same endpoints (NERSC_HPSS to LCRC_IMPROV_DTN). So, if already authenticated to the endpoints, I was just thinking that it would not be a thorough test to use "zstash --check" to transfer another archive. It would be "a test", but we (more generally) want to test when credentials have "timed-out" (not "expired", but new authentication is required).

I suppose both tests would be useful, so I should set up a run.

I'll git-clone the zstash repo, checkout the branch ("refactor-zstash"), and attempt a transfer.

@forsyth2
Copy link
Collaborator Author

we (more generally) want to test when credentials have "timed-out"

Oh yes, that's true. We wouldn't be able to properly test the endpoint authentication if you have another transfer going on.

@TonyB9000
Copy link
Collaborator

@forsyth2 NVMD - I left off the "--" on help. (Feels weird that "zstash --version" fails (must use "zstash version") but "zstash help" and "zstash command help" are not accepted...)

@forsyth2 forsyth2 mentioned this pull request Jul 30, 2025
16 tasks
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Aug 4, 2025

@TonyB9000, I'm responding to email with subject "Progress testing zstash-refactor" here because it is easier to write code in Markdown.

ISSUING zstash check command: zstash check -v --keep --cache /lcrc/group/e3sm2/DSM/Ops/Transfers/zstash/test_cache --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/w/wlin/E3SMv3/v3.LR.hist-xGHG-xaer_0201S
[POINT 1:  Is this the correct way to issue a “zstash check” on the remote NESRC archive?]

Yes. You've specified a cache and a HPSS archive. The assumption is you're running this command from a directory without an existing index.db in it.

DEBUG: Updated config using db. Now, maxsize=137438953472, path=/lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.hist-xGHG-xaer_0201, hpss=globus://nersc/~/E3SMv3/v3.LR.hist-xGHG-xaer_0201, hpss_type=HPSSType.GLOBUS
But the line with path=/lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.hist-xGHG-xaer_0201 makes no sense. Where did that come from?

Debugging
# On branch refactor-zstash

git grep -n "Updated config using db. Now,"
# zstash/utils.py:129:            f"Updated config using db. Now, maxsize={self.config.maxsize}, path={self.config.path}, hpss={self.config.hpss}, hpss_type={self.hpss_type}"

git grep -n "self.config.path = "
examples/zstash_create_globus.py:55:        self.config.path = os.path.abspath(dir_to_archive)
# zstash/utils.py:72:            self.config.path = abs_path

In zstash/utils.py, we have:

    def set_dir_to_archive(self, path: str):
        abs_path = os.path.abspath(path)
        if abs_path is not None:
            self.config.path = abs_path
            self.dir_to_archive_relative = path
        else:
            raise ValueError(f"Invalid path={path}")
git grep -n "\.set_dir_to_archive("
# zstash/create.py:176:    command_info.set_dir_to_archive(args.path)
# zstash/extract.py:107:    command_info.set_dir_to_archive(os.getcwd())
# zstash/ls.py:75:    command_info.set_dir_to_archive(os.getcwd())
# zstash/update.py:110:    command_info.set_dir_to_archive(os.getcwd())

We're on zstash check, so let's follow zstash extract:

def setup_extract(command_info: CommandInfo, arg_list: List[str]) -> argparse.Namespace:
    # [...]
    if args.cache:
        command_info.cache_dir = args.cache
    command_info.keep = args.keep
    command_info.set_dir_to_archive(os.getcwd())
    command_info.set_hpss_parameters(args.hpss, null_hpss_allowed=True)

Let's look at what this was on the main branch.

In zstash/settings.py, we have:

# Class to hold configuration
class Config(object):
    path: Optional[str] = None
    hpss: Optional[str] = None
    maxsize: Optional[int] = None
git grep -n "config\.path" zstash
# zstash/create.py:31:    if config.path is not None:
# zstash/create.py:32:        path: str = config.path
# zstash/create.py:34:        raise TypeError("Invalid config.path={}".format(config.path))
# zstash/create.py:179:    config.path = os.path.abspath(args.path)
# zstash/extract.py:207:    logger.debug("Local path : {}".format(config.path))
# zstash/update.py:123:    # config.path = os.path.abspath(args.path)
# zstash/update.py:189:    logger.debug("Local path : {}".format(config.path))

Again, we follow the extract:

def extract_database(
    args: argparse.Namespace, cache: str, keep_files: bool
) -> List[FilesRow]:
    # [...]
    logger.debug("Running zstash " + cmd)
    logger.debug("Local path : {}".format(config.path))
    logger.debug("HPSS path  : {}".format(config.hpss))
    logger.debug("Max size  : {}".format(config.maxsize))
    logger.debug("Keep local tar files : {}".format(keep))

It looks to me like path is more-or-less just for debugging, when used on the check/extract side anyway.

You have:

cache: /lcrc/group/e3sm2/DSM/Ops/Transfers/zstash/test_cache
HPSS: globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec/ ==> home/w/wlin/E3SMv3/v3.LR.hist-xGHG-xaer_0201S
path: /lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.hist-xGHG-xaer_0201

Was /lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.hist-xGHG-xaer_0201 where you were running from?

DEBUG: Local path : /lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.hist-xGHG-xaer_0201
[Point 2: Is zstash confused? The DEBUG line above has a path from Wuyin Lin’s local (/LCRC) directory]

Debugging
# Back on branch refactor-zstash
git grep -n "Local path :" zstash
# zstash/extract.py:148:    logger.debug(f"Local path : {command_info.config.path}")
# zstash/update.py:153:    logger.debug(f"Local path : {command_info.config.path}")

Following extract:

def extract_database(
    command_info: CommandInfo, args: argparse.Namespace, do_extract_files: bool
) -> List[FilesRow]:
    # [...]
    logger.debug("Running zstash " + cmd)
    logger.debug(f"Local path : {command_info.config.path}")
    logger.debug(f"HPSS path  : {command_info.config.hpss}")
    logger.debug(f"Max size  : {command_info.config.maxsize}")
    logger.debug(f"Keep local tar files : {command_info.keep}")

So, same point/question as above: Was /lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.hist-xGHG-xaer_0201 where you were running from?

INFO: Transferring file from HPSS: /lcrc/group/e3sm2/DSM/Ops/Transfers/zstash/test_cache/000000.tar
The 6th line above says “from HPSS: /lcrc/group/e3sm2/DSM/Ops/Transfers/zstash/test_cache/000000.tar”, but it should say “to”.
No wonder it is hanging – if it is now going in the wrong direction. Or is the messaging messed up?

Debugging
git grep -n "Transferring file " zstash
# zstash/hpss.py:68:        logger.info(f"Transferring file {transfer_word} HPSS: {file_path}")
# zstash/hpss.py:123:            error_str: str = f"Transferring file {transfer_word} HPSS: {name}"

It's an INFO line, so let's follow the first.

def hpss_transfer(
    command_info: CommandInfo,
    file_path: str,
    transfer_type: str,
    non_blocking: bool = False,
):
    # [...]
    transfer_word: str
    transfer_command: str
    if transfer_type == "put":
        transfer_word = "to"
        transfer_command = "put"
    elif transfer_type == "get":
        transfer_word = "from"
        transfer_command = "get"
    else:
        raise ValueError("Invalid transfer_type={}".format(transfer_type))
    logger.info(f"Transferring file {transfer_word} HPSS: {file_path}")

So, we must about to get, which makes sense because check runs hpss get to get the file.

Let's go up a level.

git grep -n "hpss_transfer("
# zstash/hpss.py:14:def hpss_transfer(
# zstash/hpss.py:157:    hpss_transfer(command_info, file_path, "put", non_blocking)
# zstash/hpss.py:164:    hpss_transfer(command_info, file_path, "get")

Let's follow the get path.

def hpss_get(command_info: CommandInfo, file_path: str):
    """
    Get a file from the HPSS archive.
    """
    hpss_transfer(command_info, file_path, "get")
git grep -n "hpss_get("
# zstash/extract.py:122:            hpss_get(command_info, command_info.get_db_name())
# zstash/extract.py:518:                hpss_get(command_info, tfname)
# zstash/hpss.py:160:def hpss_get(command_info: CommandInfo, file_path: str):
# zstash/ls.py:89:                hpss_get(command_info, command_info.get_db_name())
# zstash/update.py:128:            hpss_get(command_info, command_info.get_db_name())

So, for extract, we can see we're either getting the database file or a specific tarfile.

In extract.py:

def open_tar_with_retries(
    command_info: CommandInfo,
    files_row: FilesRow,
    args: argparse.Namespace,
    cur: sqlite3.Cursor,
    multiprocess_worker: Optional[parallel.ExtractWorker] = None,
) -> Tuple[str, tarfile.TarFile]:
    # [...]
    tfname: str = os.path.join(command_info.cache_dir, files_row.tar)
    # [...]
    if do_retrieve:
        hpss_get(command_info, tfname)
        if not check_sizes_match(cur, tfname):
            raise RuntimeError(f"{tfname} size does not match expected size.")

Let's compare to the main branch:

def extractFiles(  # noqa: C901
    files: List[FilesRow],
    keep_files: bool,
    keep_tars: Optional[bool],
    cache: str,
    cur: sqlite3.Cursor,
    args: argparse.Namespace,
    multiprocess_worker: Optional[parallel.ExtractWorker] = None,
) -> List[FilesRow]:
    # [...]
    tfname = os.path.join(cache, files_row.tar)
    # [...]
    if do_retrieve:
        hpss_get(hpss, tfname, cache)
        if not check_sizes_match(cur, tfname):
            raise RuntimeError(
                f"{tfname} size does not match expected size."
            )

Let's dive into hpss.hpss_get on main:

def hpss_get(hpss: str, file_path: str, cache: str):
    """
    Get a file from the HPSS archive.
    """
    hpss_transfer(hpss, file_path, "get", cache, False)

And hpss.hpss_transfer:

def hpss_transfer(
    hpss: str,
    file_path: str,
    transfer_type: str,
    cache: str,
    keep: bool = False,
    non_blocking: bool = False,
    is_index: bool = False,
):
    # [...]
    url = urlparse(hpss)
    # [...]
    url_path = url.path
    # [...]
    path, name = os.path.split(file_path)
    # [...]
    globus_status = globus_transfer(
        endpoint, url_path, name, transfer_type, non_blocking
    )

Ok, so on main we're transferring from the HPSS path to a file with the name from that file path passed in that includes the cache.

Let's go back to refactor-zstash.

def hpss_transfer(
    command_info: CommandInfo,
    file_path: str,
    transfer_type: str,
    non_blocking: bool = False,
):
    # [...]
    url = urlparse(command_info.config.hpss)
    # [...]
    url_path: str = str(url.path)
    # [...]
    path, name = os.path.split(file_path)
    # [...]
    globus_status = globus_transfer(
        command_info.globus_info,
        endpoint,
        url_path,
        name,
        transfer_type,
        non_blocking,
    )
    # [...]

So, we can see it's basically doing the same thing here. It seems like the confusion is the naming of tfname: str = os.path.join(command_info.cache_dir, files_row.tar) (or equivalent) on both branches. zstash parses what it needs from there, but I guess that raises the question of why we join the cache directory in the first place.

Following Blame for tfname = on zstash/extract.py back far enough we get to c17378c#diff-9e3fd3b1ceb051849f7bab0e0f243bccb68f375e45c343021b09d1072fdc6f2cR82, from 2017.

So, it seems like a debugging line based on code debt at this point.... The actual functionality seems ok based on the above analysis. That is, "is the messaging messed up?" seems to be the case.

Globus seems to have the right path (/~/E3SMv3/v3.LR.hist-xGHG-xaer_0201/000000.tar), and the right endpoint, but says it is “not found”.
And yet I see via the Globus web:
The files are where they should be.  So somewhere the full paths (or directionality) is wrong.

Your first Globus screenshot has "path": "/~/E3SMv3/v3.LR.hist-xGHG-xaer_0201/000000.tar. Your second has /home/w/wlin/E3SMv3/v3.LR.hist-xGHG-xaer_0201/.

That is, unless your home directory is /home/w/wlin, then those aren't the same paths...

@TonyB9000
Copy link
Collaborator

TonyB9000 commented Aug 4, 2025

@forsyth2 In every case, my directory is /lcrc/group/e3sm2/DSM/Ops/Transfers/zstash. There, I run

run_zstash_check.sh

#!/bin/bash

# mapline="LR:RFMIP,v3.LR.piClim-histaer_0201,1,na,na,/home/k/kaizhang/E3SM/E3SMv3/v3.LR.piClim-histaer/v3.LR.piClim-histaer_0201"
mapline="LR:DAMIP,v3.LR.hist-xGHG-xaer_0201,1,na,na,/home/w/wlin/E3SMv3/v3.LR.hist-xGHG-xaer_0201"

src_path=`echo $mapline | cut -f6 -d,`

NERSC_HPSS="9cd89cfd-6d04-11e5-ba46-22000b92c6ec"
local_cache="/lcrc/group/e3sm2/DSM/Ops/Transfers/zstash/test_cache"

cmd="zstash check -v --keep --cache $local_cache --hpss=globus://$NERSC_HPSS/$src_path"

echo "ISSUING zstash check command: $cmd"

$cmd

(NOTE: I switched from v3.LR.piClim-histaer_0201 to v3.LR.hist-xGHG-xaer_0201 because Kai Zhang's archive has the tarfiles in a "zstash" subdirectory, while Wuyin Lin's does not.)

The zstash command the script produced was:

zstash check -v --keep --cache /lcrc/group/e3sm2/DSM/Ops/Transfers/zstash/test_cache --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/w/wlin/E3SMv3/v3.LR.hist-xGHG-xaer_0201

@TonyB9000
Copy link
Collaborator

@forsyth2 I want to reiterate, the "zstash check" command successfully pulled over the NERSC "index.db" file to the designated local cache directory (PWD)/test_cache/, but thereafter failed to transfer the tar files from the same NERSC path.

The image in my email was a snapshot I took to demonstrate that the files are available (although I cannot tell what read-permissions they have). The first image was the Globus complaint as to some path I don't quite understand.

The sequence "/lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.hist-xGHG-xaer_0201" has no place in this operation, unless something in the tar archives has a symlink to such a path.

@TonyB9000
Copy link
Collaborator

@forsyth2 FWIW, although "tar tvf" will reveal if a file is actually a symlink, the sqlite3 "files" table will not. That should not matter in this case, as "zstack check" was not given any specific files to consider.

@forsyth2 forsyth2 mentioned this pull request Aug 5, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants