Skip to content

feat: add watch mode for continuous checkpoint evaluation#857

Merged
marta-sd merged 47 commits intomainfrom
martas/watcher
Mar 25, 2026
Merged

feat: add watch mode for continuous checkpoint evaluation#857
marta-sd merged 47 commits intomainfrom
martas/watcher

Conversation

@marta-sd
Copy link
Copy Markdown
Contributor

No description provided.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added documentation Improvements or additions to documentation nemo-evaluator-launcher tests labels Mar 17, 2026
@marta-sd marta-sd changed the title feat: add watch mode for continuous checkpoint evaluatiob feat: add watch mode for continuous checkpoint evaluation Mar 18, 2026
Comment thread docs/references/api/nemo-evaluator-launcher/api.md
@marta-sd marta-sd force-pushed the martas/watcher branch 2 times, most recently from 25a57eb to b311bd6 Compare March 19, 2026 15:07
Copy link
Copy Markdown
Collaborator

@agronskiy agronskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of small things, generally LGTM after those fixed -- unblocking with one-shot approve


WATCH_STATE_FILE = os.getenv(
"NEMO_EVALUATOR_WATCH_STATE_FILE",
Path.home() / ".nemo-evaluator" / "watch-state" / "watch-state.v1.jsonl",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug: when the env var is set, the getenv would get str, not Path, and down at the calling site it at line 38 I think it will crash because str does not have those methods.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in a9855d6

assert local_source.is_dir()
remote_destination_str = f"{username}@{hostname}:{remote_target}"
local_sources_str = " ".join(map(str, local_sources))
rsync_upload_command = f"rsync -qcaz {local_sources_str} {remote_destination_str}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: on mac, oftentimes the local paths have paths w/ spaces (yeah we linux users should think about those mac ppl), would not that break? I suggest hacvin command as a list with sub-parts - this way the .Run below will pass them to bash properly tokenized.

Copy link
Copy Markdown
Contributor Author

@marta-sd marta-sd Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was just moved from one place to another, so I assume it's functional, but better be on the safe side 👍 Done in 59dba76

"""
if socket is None:
return
ssh_command = f"ssh -O exit -S {socket} {username}@{hostname}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I would recommand to check on all the commands to be passed as lists -- shlex.split later will break if socket path on Mac contains spaces

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and it's inconsistent with how we open it 👍 Fixed in 6fda968

if socket is None:
return
ssh_command = f"ssh -O exit -S {socket} {username}@{hostname}"
completed_process = subprocess.run(args=shlex.split(ssh_command))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug(?): the default of stderr param is None, I think this would lead to double-raise later when stderr.decode is called

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there's no reason for us to throw an error here. I replaced with similar log.error message to the one we have when opening failed (6fda968)

gchlebus and others added 17 commits March 24, 2026 12:05
Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
- Add nel-watch script entry in pyproject.toml
- Remove watch subcommand from nel CLI (cli/main.py)
- Simplify cli/watch.py: remove --tmux, --env-file, _build_watch_command,
  _launch_in_tmux; add main() entrypoint; make CLI a thin 1:1 wrapper
  over the Python API
- Remove tmux/env-file/build_watch_command tests from test_watch.py
- Remove unused MagicMock import

Addresses team feedback: Unix philosophy (do one thing well),
thin CLI wrapper, separate entrypoint for watch functionality.

Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
…n API)

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
…e execution.sbatch_dependency

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
@marta-sd
Copy link
Copy Markdown
Contributor Author

/ok to test bfed76c

@marta-sd marta-sd marked this pull request as ready for review March 24, 2026 14:28
@marta-sd marta-sd requested review from a team as code owners March 24, 2026 14:28
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
@marta-sd
Copy link
Copy Markdown
Contributor Author

/ok to test 9df4cf9

Signed-off-by: Marta Stepniewska-Dziubinska <martas@nvidia.com>
@marta-sd
Copy link
Copy Markdown
Contributor Author

/ok to test 25c436f

@marta-sd marta-sd merged commit 9e3f445 into main Mar 25, 2026
48 checks passed
@marta-sd marta-sd deleted the martas/watcher branch March 25, 2026 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nemo-evaluator-launcher tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants