Skip to content

Add total memory to job info 878 #879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

Oglopf
Copy link
Contributor

@Oglopf Oglopf commented Apr 18, 2025

Addresses #878.

  • Adds a top layer method to OodCore::Job::Info to ensure we pass back memory in a consistent manner using bytes
  • Attempts to prevent issues with other schedulers beside Slurm if any fields are missing or nil
  • Handles cases where the job has been submitted for either --mem or --mem-per-cpu
  • Computes the per-node or per-cpu total memory in use
  • Still need to write a tests for slurm_spec.rb as well to ensure the adapter are correct.
  • I had trouble getting these tests to run today, need to work more with what i have in info_spec.rb to ensure correctness.

@Oglopf
Copy link
Contributor Author

Oglopf commented May 5, 2025

This is so strange. Locally I have to add OodCore::Job::Adapters::Slurm::Batch::Error to clear the last failure. Strange it fails locally yet runs in the CI, strange is not the right work though, annoying is better.

And yet, in the CI I get different failures. So, we have divergence in the tests between local and remote, which is horrible from a code velocity perspective.

Here's the other issue, rspec is radically over-engineered. The mocking we do here is out-dated with various ways of mocking states and all these intricate objects we have to mock and call into, which also isn't even consistent across tests. It's bad from a developer ergonomic perspective, really bad. Why do we the devs allow this to continue? I see a ticket #844 about the migration where there's a lot of bad info being suggested as to why we do this. I for one would love to see the tests ported. We destroy this project by burning devs out having these kinds of bizarre, outdated techniques and silly arguments that are littered with false premises except maybe one valid point about the the e2e tests.

Anyway, when I get the tests to somehow work, I'll finish this extremely simple ticket which has now taken weeks because of testing. testing.

@Oglopf Oglopf marked this pull request as ready for review May 6, 2025 17:34
Comment on lines +911 to +915
# Slurm uses per CPU memory if --mem-per-cpu with 'Mc' output
# or uses per node if --mem with 'M' output
if v[:min_memory].end_with?('c')
# memory per CPU
memory_per = :cpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you get this from? I'm unable to replicate this or see it in any job at OSC.

@@ -116,6 +121,7 @@ def initialize(id:, status:, allocated_nodes: [], submit_host: nil,
@status = job_array_aggregate_status unless @tasks.empty?

@native = native
@total_memory = total_memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should likely keep the other behavior where we check for nil and cast to_i when ti's not nil, like gpus below it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants