Skip to content

feat(submit): display cluv submit first commands and results in rich tables#57

Open
Copilot wants to merge 4 commits into
masterfrom
copilot/improve-cluv-submit-output
Open

feat(submit): display cluv submit first commands and results in rich tables#57
Copilot wants to merge 4 commits into
masterfrom
copilot/improve-cluv-submit-output

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 14, 2026

cluv submit first printed submission results and the watch-progress hint as flat console.print lines, making it hard to scan across multiple clusters. The sbatch commands also appeared as an interleaved blob of text.

Changes

  • _build_commands_table() — new helper that renders a rich.Table (ROUNDED box) with Cluster and Command columns (overflow="fold" so long commands wrap cleanly within the cell); displayed before submission so the user can see exactly what will run on each cluster
  • _build_submission_table() — new helper that renders a rich.Table (ROUNDED box) with Cluster and Job ID / Status columns; job IDs in green, errors (both non-zero sbatch exit and remote exceptions) in red; sbatch's own error prefix is preserved without duplication
  • sbatch() — gains a display keyword argument (default True) so submit_first can pass display=False, preventing the per-cluster command log from appearing as an interleaved blob during the parallel gather
  • submit_first() — pre-computes all sbatch commands before the gather, prints the commands table, then runs the gather with display=False; replaces the manual result loop with the submission table; moves the sacct watch-progress hint to after cancel_all_jobs() so it always appears last
  • TestsTestBuildSubmissionTable and TestBuildCommandsTable cover success, sbatch failure, remote exception, and rendered output content (via rich.Console → StringIO)

Example output after the change:

                           sbatch Commands
╭──────────────┬───────────────────────────────────────────────────────────────╮
│ Cluster      │ Command                                                       │
├──────────────┼───────────────────────────────────────────────────────────────┤
│ mila         │ bash --login -c 'SBATCH_TIME=3:00:00 GIT_COMMIT=abc123        │
│              │ sbatch --parsable --chdir=repos/cluv ~/repos/cluv/job.sh'     │
├──────────────┼───────────────────────────────────────────────────────────────┤
│ narval       │ bash --login -c 'SBATCH_TIME=3:00:00 GIT_COMMIT=abc123        │
│              │ sbatch --parsable --chdir=repos/cluv ~/repos/cluv/job.sh'     │
╰──────────────┴───────────────────────────────────────────────────────────────╯

                         Job Submission Results
╭──────────────┬───────────────────────────────────────────────────────────────╮
│ Cluster      │ Job ID / Status                                               │
├──────────────┼───────────────────────────────────────────────────────────────┤
│ mila         │ 12345                                                         │
├──────────────┼───────────────────────────────────────────────────────────────┤
│ narval       │ sbatch: error: Batch job submission failed: Unspecified error │
│ rorqual      │ 67890                                                         │
╰──────────────┴───────────────────────────────────────────────────────────────╯

To watch the job: ssh mila sacct -j 12345

- Add _build_submission_table() helper that renders a ROUNDED rich.Table
  with Cluster / Job ID columns (green job IDs, red error messages)
- Replace the plain console.print loop in submit_first() with the table
- Move the sacct watch hint to after the table (and after cancel_all_jobs)
- Use zip(..., strict=True) to catch length mismatches
- Add TestBuildSubmissionTable unit tests that use rich Console rendering
  to avoid brittle internal-API access

Agent-Logs-Url: https://github.com/mila-iqia/cluv/sessions/bcbbed1d-5595-4ef3-9164-37d2628488d7

Co-authored-by: lebrice <13387299+lebrice@users.noreply.github.com>
Copilot AI changed the title [WIP] Improve cluv submit output with tables feat(submit): display cluv submit first results in a rich table May 14, 2026
Copilot AI requested a review from lebrice May 14, 2026 17:43
@lebrice
Copy link
Copy Markdown
Contributor

lebrice commented May 14, 2026

@copilot Here's what the end of the output looks like now:

$ cluv submit first scripts/job.sh -- python --version
(...)
✓ Syncing project               36/36 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:01:04 0:00:00
✓ local -> mila - Done.         4/4   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:08 0:00:00
✓ local -> tamia - Done.        4/4   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:28 0:00:00
✓ local -> killarney - Done.    4/4   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:07 0:00:00
✓ local -> vulcan - Done.       4/4   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:05 0:00:00
✓ local -> rorqual - Done.      4/4   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:11 0:00:00
✓ local -> fir - Done.          4/4   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:56 0:00:00
✓ local -> trillium - Done.     4/4   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:09 0:00:00
✓ local -> trillium-gpu - Done. 4/4   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:09 0:00:00
✓ local -> narval - Done.       4/4   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:13 0:00:00
[16:12:06] (mila) $ bash --login -c 'SBATCH_TIME=3:00:00 SBATCH_REQUEUE=1 UV_OFFLINE=0 WANDB_MODE=online SBATCH_JOB_NAME=cluv-job GIT_COMMIT=a5208d1635e7fc3e34eb77019425260f3fb20b07 sbatch --parsable --chdir=repos/cluv                 submit.py:300
           ~/repos/cluv/scripts/job.sh python --version'                                                                                                                                                                                                
           (tamia) $ bash --login -c 'SBATCH_TIME=3:00:00 SBATCH_REQUEUE=1 UV_OFFLINE=1 WANDB_MODE=offline SBATCH_JOB_NAME=cluv-job GIT_COMMIT=a5208d1635e7fc3e34eb77019425260f3fb20b07 sbatch --parsable --chdir=repos/cluv               submit.py:300
           ~/repos/cluv/scripts/job.sh python --version'                                                                                                                                                                                                
           (killarney) $ bash --login -c 'SBATCH_TIME=3:00:00 SBATCH_REQUEUE=1 UV_OFFLINE=1 WANDB_MODE=offline SBATCH_JOB_NAME=cluv-job GIT_COMMIT=a5208d1635e7fc3e34eb77019425260f3fb20b07 sbatch --parsable --chdir=repos/cluv           submit.py:300
           ~/repos/cluv/scripts/job.sh python --version'                                                                                                                                                                                                
           (vulcan) $ bash --login -c 'SBATCH_TIME=3:00:00 SBATCH_REQUEUE=1 UV_OFFLINE=1 WANDB_MODE=offline SBATCH_JOB_NAME=cluv-job GIT_COMMIT=a5208d1635e7fc3e34eb77019425260f3fb20b07 sbatch --parsable --chdir=repos/cluv              submit.py:300
           ~/repos/cluv/scripts/job.sh python --version'                                                                                                                                                                                                
           (rorqual) $ bash --login -c 'SBATCH_TIME=3:00:00 SBATCH_REQUEUE=1 UV_OFFLINE=1 WANDB_MODE=offline SBATCH_ACCOUNT=rrg-bengioy-ad SBATCH_JOB_NAME=cluv-job GIT_COMMIT=a5208d1635e7fc3e34eb77019425260f3fb20b07 sbatch --parsable  submit.py:300
           --chdir=repos/cluv  ~/repos/cluv/scripts/job.sh python --version'                                                                                                                                                                            
           (fir) $ bash --login -c 'SBATCH_TIME=3:00:00 SBATCH_REQUEUE=1 UV_OFFLINE=0 WANDB_MODE=online SBATCH_ACCOUNT=rrg-bengioy-ad SBATCH_JOB_NAME=cluv-job GIT_COMMIT=a5208d1635e7fc3e34eb77019425260f3fb20b07 sbatch --parsable       submit.py:300
           --chdir=repos/cluv  ~/repos/cluv/scripts/job.sh python --version'                                                                                                                                                                            
           (trillium) $ bash --login -c 'SBATCH_TIME=3:00:00 SBATCH_REQUEUE=1 UV_OFFLINE=1 WANDB_MODE=offline SBATCH_ACCOUNT=rrg-bengioy-ad SBATCH_JOB_NAME=cluv-job GIT_COMMIT=a5208d1635e7fc3e34eb77019425260f3fb20b07 sbatch --parsable submit.py:300
           --chdir=repos/cluv  ~/repos/cluv/scripts/job.sh python --version'                                                                                                                                                                            
           (trillium-gpu) $ bash --login -c 'SBATCH_TIME=3:00:00 SBATCH_REQUEUE=1 UV_OFFLINE=1 WANDB_MODE=offline SBATCH_ACCOUNT=rrg-bengioy-ad SBATCH_JOB_NAME=cluv-job GIT_COMMIT=a5208d1635e7fc3e34eb77019425260f3fb20b07 sbatch        submit.py:300
           --parsable --chdir=repos/cluv  ~/repos/cluv/scripts/job.sh python --version'                                                                                                                                                                 
           (narval) $ bash --login -c 'SBATCH_TIME=3:00:00 SBATCH_REQUEUE=1 UV_OFFLINE=1 WANDB_MODE=offline SBATCH_ACCOUNT=def-bengioy SBATCH_JOB_NAME=cluv-job GIT_COMMIT=a5208d1635e7fc3e34eb77019425260f3fb20b07 sbatch --parsable      submit.py:300
           --chdir=repos/cluv  ~/repos/cluv/scripts/job.sh python --version'                                                                                                                                                                            
                                                        Job Submission Results                                                        
╭──────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Cluster      │ Job ID / Status                                                                                                     │
├──────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ mila         │ 9553020                                                                                                             │
├──────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ tamia        │ 295604                                                                                                              │
├──────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ killarney    │ error: sbatch: error: ------------------------------------------------------                             │
│              │ sbatch: error: Submitting jobs from directories residing in /home is                                     │
│              │ sbatch: error: not permitted. Transfer your files to a directory in                                      │
│              │ sbatch: error: /scratch or /project and submit the job from there.                                       │
│              │ sbatch: error: ------------------------------------------------------                                    │
│              │ sbatch: error: Batch job submission failed: Unspecified error                                                       │
├──────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ vulcan       │ 4965164                                                                                                             │
├──────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ rorqual      │ 12333400                                                                                                            │
├──────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ fir          │ error: sbatch: error: --------------------------------------------------------------------------------       │
│              │ sbatch: error: You are not a member of the specified account rrg-bengioy-ad.                                        │
│              │ sbatch: error: You cannot use this account to submit a job.                                                         │
│              │ sbatch: error: Please use one of the following accounts:                                                            │
│              │ sbatch: error:   RAS default accounts: def-bengioy, def-normandf,                                                   │
│              │ sbatch: error:           RAC accounts:                                                                              │
│              │ sbatch: error: Compute-Burst accounts:                                                                              │
│              │ sbatch: error:         Other accounts: cc-debug,                                                                    │
│              │ sbatch: error: Use the parameter --account=desired_account when submitting your job.                                │
│              │ sbatch: error: If you think that you should be a member of rrg-bengioy-ad and the PI or account coordinator agrees, │
│              │ sbatch: error: the instructions to add your account by the PI/account coordinator can be found here:                │
│              │ sbatch: error: https://docs.computecanada.ca/wiki/Using_a_resource_allocation#Information_on_each_resource          │
│              │ sbatch: error: Please note than any changes made to account membership in CCDB                                      │
│              │ sbatch: error: may take up to 30 minutes to apply on a cluster.                                                     │
│              │ sbatch: error: --------------------------------------------------------------------------------                 │
│              │ sbatch: error: Batch job submission failed: Unspecified error                                                       │
├──────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ trillium     │ error: SBATCH WARNING:                                                                                              │
│              │  Following the job script name (/home/normandf/repos/cluv/scripts/job.sh),                                          │
│              │  there are one or more sbatch-looking arguments (--version), but arguments for                                      │
│              │  sbatch must be given before the job script, or else they will be ignored and                                       │
│              │  passed as arguments to the job script.                                                                             │
│              │ SBATCH WARNING:                                                                                                     │
│              │  Scheduling is by whole 192-core node, but --nodes is not set. This job will                                        │
│              │  take 1 node(s), i.e. 192 physical cores.                                                                           │
│              │ SBATCH ERROR:                                                                                                       │
│              │  The --mem-per-cpu=... request is not allowed nor necessary on Trillium; all                                        │
│              │  nodes have the same amount of available memory, and the available memory per                                       │
│              │  thread is the total available memory (745 GiB) divided by the number of                                            │
│              │  threads on the node.                                                                                               │
│              │ SBATCH ERROR:                                                                                                       │
│              │  Walltime must be at least 15 minutes (except on the debug partition)                                               │
│              │  (/home/normandf/repos/cluv/scripts/job.sh, line #5)                                                                │
│              │ SBATCH: 2 warnings were found.                                                                                      │
│              │ SBATCH: 2 errors were found.                                                                                        │
│              │ SBATCH: Job not submitted because of these errors.                                                                  │
│              │ SBATCH: For more info see https://docs.alliancecan.ca/wiki/Trillium_Quickstart                                      │
├──────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ trillium-gpu │ error: SBATCH WARNING:                                                                                              │
│              │  Following the job script name (/home/normandf/repos/cluv/scripts/job.sh),                                          │
│              │  there are one or more sbatch-looking arguments (--version), but arguments for                                      │
│              │  sbatch must be given before the job script, or else they will be ignored and                                       │
│              │  passed as arguments to the job script.                                                                             │
│              │ SBATCH ERROR:                                                                                                       │
│              │  No --gpu-per-nodes option given                                                                                    │
│              │ SBATCH ERROR:                                                                                                       │
│              │  The --mem... options are not allowed on Trillium; per gpu, there is always 186                                     │
│              │  GiB of host memory available, while whole-node jobs get the full memory of the                                     │
│              │  node (745 GiB)                                                                                                     │
│              │ SBATCH: 1 warning was found.                                                                                        │
│              │ SBATCH: 2 errors were found.                                                                                        │
│              │ SBATCH: Job not submitted because of these errors.                                                                  │
│              │ SBATCH: For more info see https://docs.alliancecan.ca/wiki/Trillium_Quickstart                                      │
├──────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ narval       │ 60967036                                                                                                            │
╰──────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
[16:12:12] Job 295604 on cluster tamia is running. Cancelling the other jobs...                                                                                                                                                            submit.py:201
Cancelled job 4965164 on cluster vulcan.
Cancelled job 9553020 on cluster mila.
Cancelled job 12333400 on cluster rorqual.
Cancelled job 60967036 on cluster narval.

To watch the job: ssh tamia sacct -j 295604

The alignment of the tables is slightly wonky when there are error messages coming form sbatch. Perhaps that could be fixed with textwrap? I'm not sure.

Also, the commands used for each cluster still show up as one big blob of text. Can we use a table for that as well?

- Add _build_commands_table() helper: pre-compute all sbatch commands
  before the parallel gather and display them in a ROUNDED rich.Table
  with overflow="fold" so long commands wrap cleanly within the column.
  The gather now passes display=False to avoid the interleaved command
  blob in the console output.
- sbatch() gains a display kwarg (default True) so the single-cluster
  path is unchanged.
- _build_submission_table: drop the redundant "error: " prefix on
  sbatch stderr rows — sbatch already includes "sbatch: error:" /
  "SBATCH ERROR:" itself, producing an ugly double-error line.
- Move StringIO / Console imports to module level in test_submit.py.

Agent-Logs-Url: https://github.com/mila-iqia/cluv/sessions/d363751b-2ea8-4638-a12c-77d466042183

Co-authored-by: lebrice <13387299+lebrice@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 14, 2026

@copilot Here's what the end of the output looks like now:

$ cluv submit first scripts/job.sh -- python --version

...

Both issues addressed in d9b0db5:

Commands table — sbatch commands are now pre-computed before the parallel gather and shown in a _build_commands_table() table with overflow="fold" so long lines fold neatly within the column. The gather runs with display=False, eliminating the interleaved blob entirely.

Error alignment — the doubled error: sbatch: error: came from us prepending "error: " when result.stderr already starts with sbatch: error: / SBATCH ERROR:. Dropped our redundant prefix; errors stay red.

Here's what the output now looks like:

                           sbatch Commands
╭──────────────┬───────────────────────────────────────────────────────────────────────╮
│ Cluster      │ Command                                                               │
├──────────────┼───────────────────────────────────────────────────────────────────────┤
│ mila         │ bash --login -c 'SBATCH_TIME=3:00:00 GIT_COMMIT=abc123 sbatch         │
│              │ --parsable --chdir=repos/cluv  ~/repos/cluv/scripts/job.sh python     │
│              │ --version'                                                            │
├──────────────┼───────────────────────────────────────────────────────────────────────┤
│ narval       │ bash --login -c 'SBATCH_TIME=3:00:00 SBATCH_ACCOUNT=def-bengioy       │
│              │ GIT_COMMIT=abc123 sbatch --parsable --chdir=repos/cluv                │
│              │ ~/repos/cluv/scripts/job.sh python --version'                        │
╰──────────────┴───────────────────────────────────────────────────────────────────────╯

                         Job Submission Results
╭──────────────┬───────────────────────────────────────────────────────────────╮
│ Cluster      │ Job ID / Status                                               │
├──────────────┼───────────────────────────────────────────────────────────────┤
│ mila         │ 12345                                                         │
├──────────────┼───────────────────────────────────────────────────────────────┤
│ narval       │ sbatch: error: Batch job submission failed: Unspecified error │
╰──────────────┴───────────────────────────────────────────────────────────────╯

To watch the job: ssh mila sacct -j 12345

Copilot AI changed the title feat(submit): display cluv submit first results in a rich table feat(submit): display cluv submit first commands and results in rich tables May 14, 2026
@lebrice lebrice marked this pull request as ready for review May 14, 2026 21:24
@lebrice
Copy link
Copy Markdown
Contributor

lebrice commented May 14, 2026

@copilot Fix the pre-commit issues, please and thank you

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 14, 2026

@copilot Fix the pre-commit issues, please and thank you

Fixed in fea55e2 — ruff flagged an unused from rich.text import Text import in tests/test_submit.py (left over after moving to module-level imports). Removed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improvement: cluv submit first output could be made prettier using tables

2 participants