- 
                Notifications
    
You must be signed in to change notification settings  - Fork 2k
 
scheduler: perform feasibility checks for system canaries before computing placements #26953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Open
      
      
            pkazmierczak
  wants to merge
  21
  commits into
  main
  
    
      
        
          
  
    
      Choose a base branch
      
     
    
      
        
      
      
        
          
          
        
        
          
            
              
              
              
  
           
        
        
          
            
              
              
           
        
       
     
  
        
          
            
          
            
          
        
       
    
      
from
f-system-deployments-canaries-evict-refactor
  
      
      
   
  
    
  
  
  
 
  
      
    base: main
Could not load branches
            
              
  
    Branch not found: {{ refName }}
  
            
                
      Loading
              
            Could not load tags
            
            
              Nothing to show
            
              
  
            
                
      Loading
              
            Are you sure you want to change the base?
            Some commits from the old base branch may be removed from the timeline,
            and old review comments may become outdated.
          
          
      
        
          +1,566
        
        
          −905
        
        
          
        
      
    
  
Conversation
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
    
              
                    tgross
  
              
              reviewed
              
                  
                    Oct 17, 2025 
                  
              
              
            
            
b057f84    to
    88f1fb4      
    Compare
  
    
              
                    tgross
  
              
              reviewed
              
                  
                    Oct 23, 2025 
                  
              
              
            
            
              
                    tgross
  
              
              reviewed
              
                  
                    Oct 23, 2025 
                  
              
              
            
            
838bcd8    to
    e1234c1      
    Compare
  
    e1234c1    to
    a6cd581      
    Compare
  
    
    
  tgross 
      added a commit
      that referenced
      this pull request
    
      Oct 28, 2025 
    
    
      
  
    
      
    
  
Two groups on the same job cannot both have a static port assignment, but this ends up getting configured in the update block test for system deployments. This test setup bug has complicated landing the fix in #26953.
    
  tgross 
      added a commit
      that referenced
      this pull request
    
      Oct 28, 2025 
    
    
      
  
    
      
    
  
Two groups on the same job cannot both have a static port assignment, but this ends up getting configured in the update block test for system deployments. This test setup bug has complicated landing the fix in #26953.
697d0ff    to
    0c70ac8      
    Compare
  
    54b7110    to
    e51cce3      
    Compare
  
    
              
                    jrasell
  
              
              reviewed
              
                  
                    Oct 31, 2025 
                  
              
              
            
            
… state Previously we copied the behavior found in the generic scheduler, where we rely on reconciler results to decide if there's enough placements made. In the system scheduler we always know exactly how many placements there should be based on the DesiredTotal field of the deployment state, so a better way to check completeness of the deployment is to simplify it and base on dstate alone.
…iler In contrast to the cluster reconciler, in the node reconciler we go node-by-node as opposed to alloc-by-alloc, and thus the state of the reconciler has to be managed differently. If we override old deployments on every run of `cancelUnneededDeployments`, we end up with unnecessarily created deployments for job version that already had them.
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
      
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Canaries for system jobs are placed on a
tg.update.canarypercent of eligible nodes. Some of these nodes may not be feasible, and until now we removed infeasible nodes during placement computation. However, if it happens to be that the first eligible node we picked to place a canary on is infeasible, this will lead to the scheduler halting deployment.The solution presented here simplifies canary deployments: initially, system jobs that use canary updates get allocations placed on all eligible nodes, but before we start computing actual placements, a method called
evictUnneededCanariesis called (much likeevictAndPlaceis for honoring MaxParallel) which removes those canary placements that are not needed. We also change the behavior ofcomputePlacementswhich no longer performs node feasibility checks, as these are performed earlier for every allocation and node. This way we get accurate counts of all feasible nodes that let us correctly set deployment state fields.Fixes: #26885
Fixes: #26886