- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.4k
 
Add chaos to S3BulkDumping test. #12522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
          Result of foundationdb-pr-macos on macOS Ventura 13.x
  | 
    
          Result of foundationdb-pr-clang-ide on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
  | 
    
          Result of foundationdb-pr-clang-arm on Linux CentOS 7
  | 
    
          Result of foundationdb-pr-clang on Linux RHEL 9
  | 
    
101e6d1    to
    23e94ac      
    Compare
  
    
          Result of foundationdb-pr on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-macos on macOS Ventura 13.x
  | 
    
          Result of foundationdb-pr-cluster-tests on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-clang-ide on Linux RHEL 9
  | 
    
          Result of foundationdb-pr on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-clang-arm on Linux CentOS 7
  | 
    
          Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
  | 
    
          Result of foundationdb-pr-clang on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-macos on macOS Ventura 13.x
  | 
    
          Result of foundationdb-pr-clang-ide on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-cluster-tests on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-clang-ide on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-clang on Linux RHEL 9
  | 
    
          Result of foundationdb-pr on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-clang-arm on Linux CentOS 7
  | 
    
          Result of foundationdb-pr-clang-arm on Linux CentOS 7
  | 
    
          Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
  | 
    
          Result of foundationdb-pr-cluster-tests on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-clang on Linux RHEL 9
  | 
    
          Result of foundationdb-pr on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-cluster-tests on Linux RHEL 9
  | 
    
2bb7374    to
    06ee8f2      
    Compare
  
    
          Result of foundationdb-pr-macos on macOS Ventura 13.x
  | 
    
          Result of foundationdb-pr-macos on macOS Ventura 13.x
  | 
    
          Result of foundationdb-pr-clang-ide on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
  | 
    
          Result of foundationdb-pr-clang-arm on Linux CentOS 7
  | 
    
          Result of foundationdb-pr-clang on Linux RHEL 9
  | 
    
          Result of foundationdb-pr on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-cluster-tests on Linux RHEL 9
  | 
    
| 
           I ran the new BulkLoadingS3WithChaos test 100k times on joshua 
 And here is the general 100k joshua run 
  | 
    
| 
           Looking at the k8s fail, it seems like all the tests claim to have passed  | 
    
| 
           In general, it looks good to me! What you add more details about the deadlock issue in the comments? Thanks!  | 
    
| 
           @kakaiu Hang was like this (not sure where to put it in code... ) FetchKeys operations stuck during bulk load with DataDistributionQueueSize at 1. ConsistencyCheck times out with DataDistributionQueueSize="1" repeatedly. Happened during medium and heavy chaos. bulkLoadDownloadTaskFileSets() skipped ranges without data files entirely, so no local fileSet entry was created. Code before change... We just moved on to the next item leaving no accounting for the empty range.  | 
    
| if (!range->value().isValid()) { | ||
| continue; | ||
| } | ||
| // Skip empty ranges (no data file to load) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help to check if DD skips noDataFile range when creating the bulkload tasks from manifests? Thanks! I think ideally, DD should not schedule noDataFile range as bulkload tasks. My impression is that but the DD code may change over time and break this invariant. Maybe supporting customized range loading causes the problem. Can you help to confirm? Thanks!
Nevertheless, this protection is great! Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at where this code was introduced
commit 0e736c6
Author: Zhe Wang [email protected]
Date:   Mon Mar 17 11:45:15 2025 -0700
Allow One BulkloadTask Do Multiple Manifests  (#12036)
As best as I can tell, adding emptyRange was possible before the change, so I guess even before customized range loading. What you think should be done here? Should we add it as an invariant -- no entry unless data? Will that break us elsewhere? Could do as a follow-on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it as an invariant. Do Sev30 at storage server side if the invariant is broken. Do some check and enforce the invariant at DD side. We can do this in the next PR. Thanks!
        
          
                fdbserver/QuietDatabase.actor.cpp
              
                Outdated
          
        
      | int64_t maxVersionOffset = 1e6) { | ||
| state QuietDatabaseChecker checker(isGeneralBuggifyEnabled() ? 4000.0 : 1500.0); | ||
| int64_t maxVersionOffset = 1e6, | ||
| double quiescentWaitTimeout = 0) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to add this quiescentWaitTimeout? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was hardcoded. Wanted to make it configurable per test. We were timing out in the consistency check. The s3 stuff w/ chaos -- especially medium and heavy -- were timing out.... Takes longer. The s3 stuff was tamed some subsequentially with less retries and less time between. I might be able to make it work inside the hardcodings? (Follow on?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. According to the definition of QuietDatabaseChecker indicating the input is maxDDRunTime. If DD timed out, there is a ddGotStuck assertion failure. So, probably better to change quiescentWaitTimeout to maxDDRunTime.
The maxDDRunTime is different from the option of quiescentWaitTimeout for consistency check workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. You are right. Let me change parameter name....
        
          
                fdbserver/storageserver.actor.cpp
              
                Outdated
          
        
      | // This allows adding->start() to be called inline with CSK. | ||
| try { | ||
| if (conductBulkLoad) { | ||
| TraceEvent(SevDebug, "BulkLoadFetchKeysBeforeCoreStarted", data->thisServerID) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably adding "SS" to the prefix which helps you to quickly find which role triggers the event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Thank you (I had it on a few but not all)
The parameter name 'quiescentWaitTimeout' was misleading. It actually controls the maximum time Data Distributor can run before being considered stuck (triggers ddGotStuck assertion), not a general quiescence timeout. Renamed throughout the codebase. This makes it clear the parameter is specifically for DD timeout checking. Add 'SS' prefix to log tags in storageserver.actor.cpp trace additions.
          Result of foundationdb-pr-clang on Linux RHEL 9
  | 
    
          Result of foundationdb-pr on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-clang-arm on Linux CentOS 7
  | 
    
          Result of foundationdb-pr-cluster-tests on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-clang-ide on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
  | 
    
          Result of foundationdb-pr-macos on macOS Ventura 13.x
  | 
    
          Result of foundationdb-pr-clang-ide on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-clang-arm on Linux CentOS 7
  | 
    
          Result of foundationdb-pr-clang on Linux RHEL 9
  | 
    
          Result of foundationdb-pr on Linux RHEL 9
  | 
    
          Result of foundationdb-pr-macos on macOS Ventura 13.x
  | 
    
          Result of foundationdb-pr-cluster-tests on Linux RHEL 9
  | 
    
Based on #12515