- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.4k
 
Write Bandwidth Project Meeting Notes
        Alex Miller edited this page Apr 29, 2020 
        ·
        11 revisions
      
    - Work continuing on Persistent ART
- Another ART implementation in HOPE if it's better/faster/more tested
 - One could maintain an ART of N second windows to re-use the existing Arena-based one.
 - One could maintain an arena for every N seconds, and then concurrently walk through and remove references into older arenas.
 - If one maintains version history in values rather than keys, then a single-version structure would work. Atomic op-heavy workloads would be the worst case for doing this, though.
 - Entirely different strategy: throw more memory at MVCC buffer?
 
 - Rusty's tree changes likely to go in soon
 - Rusty comparing memcpy implementations, to hopefully get an overall speedup
- It might possibly be worth dusting off the "Align arena memory" PR
 
 - Trevor has another PR for reducing copies with Futures/Promises
- And has more changes planned for reducing copies in the proxy
 
 
- Allow moving out of futures and actor compiler changes to reduce copies
- Avoiding unnecessary copies from movable futures
 - R-value references are complicated
 
 - Daniel ran Rusty's ART benchmark
- Benchmark used random strings, which biased towards making ART work well
 - Result was 7x faster, which looks promising
 - ART code itself relied on area for memory safety
 - Struggled to get refcounting on children working right
 
 - Daniel re-did benchmarks, and found out his target bandwidth is lower than expected.
- Will look into RocksDB PR after spending a bit of time on ART
 - Better storage server write throughput is needed, something closer to Memory than Redwood
 - We should check with Markus, who mentioned someone on RocksDB might have gone peeking into this
 - There's still feelings of uncertainty about RocksDB correctness, and FDB simulation testing wouldn't help find them, as RocksDB would be running on its own threads outside of FDB's control
 
 
- Avoid unnecessary copies in PromiseStream #2915 landed, and mostly resulted in write bandwidth improvements
 - Investigating doing more similar changes
- Modifying actor compiler to get perfect forwarding in more places #2928
 - Potentially creating a MovableFuture class, so that one can move out of futures.
 - A longer discussion occurred on design choices around futures, actor compiler changes, and resumable functions
 - Eliding copies from reply generation could have similar improvements.
 - Using a better memcpy implementation resulted in a 5%-10% perf increase.
- Rusty to look into this after current work is done. Ask Kao for brain dump.
 
 - make-linux-fast-again.com to disable kernel security patches when comparing perf across versions
 
 
- Storage server profiling and optimizations
- Daniel and Rusty have both not found low hanging fruit to further optimize
 - Daniel to run and profile a test with Rusty's three way comparison merged
 - 5% of CPU time spent allocating std::vector, but unsure if there's a good way to reduce that
 
 - Daniel seeing stalls during performance tests
- Are there recoveries? OOMs? Seems not.
 - Likely Ratekeeper?
 - Alex to connect with Segment folk about getting FoundationDB metrics into Datadog. (Sent!)
 
 - P-Tree improvements/rewrites
- Daniel chatting with Steve about if there's a better datastructure to use than the existing P-Tree
 - Neelam's previous investigation suggested that there's no low-hanging optimizations to do on the P-Tree itself
 - Persistent Adaptive Radix Tree might be a good candidate
 - (Persistent in the versioned sense, and not persistent as in disk.)
 - 
HOPE would maybe be useful to reduce in-memory size of data?
- Rusty to bug Pavlo about releasing the code
 - Update: HOPE source was released
 
 
 - New Slack channel of #write-throughput-discuss made for this project
 - Proxy optimizations
- 550micros spent in batching
 - Via tweaking knobs, this was reduced to 200micros
 - In the process of this work, found Avoid unnecessary copies in PromiseStream , which might help for more loaded cases.
 
 
- Storage server profiles show 25% CPU time in operator<
- Rusty's approach: reduce the number of calls to operator< made by lower_bound and upper_bound
 - Daniel's approach: Pass StringRef by value in comparison operators
 - Rusty has a larger change in the works to move from operator< to operator<=> for more gains
 - Taking an 
ssdprofile again would be good? - Try tweaking knobs to lower the number of versions a storage server keeps in memory to see what effect that has
 
 - Proxy CPU
- Had used debug transaction to build up stats on where time is going in commits
 - Looked like more time is spent batching transactions on proxy than in TLogs
 - Exploring ways to cheat the commit path, and drop strict serializability if desired
 - Policy engine optimizations might have a large impact on proxy cpu time
 
 - Will continue adding more debug transactions to get better pipeline visualization
 
- Rusty's updates
- Initial results from network loop optimization resulted in a 1%-2% speedup
 - Next focus will be on storage server optimizations
 
 - Daniel's benchmarking run results
- Tracing of DebugCommit doesn't seem to show large pipelining waits
 - AWS seems to show ~0.1ms for within AZ, and ~0.3ms and ~0.7ms for across different AZs
 - Adding more clients now shows storage server saturation
 - Will continue to run more benchmarks, take and forward profiles to others to examine more closely
 
 
- Pipelining
- With Chrome Tracing, we now have some tools we can use to investigate commit pipelining.
 - How do we identify issues? What do we think could be issues? What tests should we be running to find issues?
 - Setting 
\xff\x02/timeKeeper/disablewill disable timekeeper, which is the every 10s debug transaction. - Trevor is doing work in this area also.  Will invite him to future meetings.
- Investigating instances where more time being spent in proxy than waiting for tlog fsync.
 
 - Resolvers cannot just be turned off for conflict-free workloads due to their involvement in 
\xff 
 - Other improvements
- Rusty's event loop 2.0?
 
 
To do for next week:
- The suspicion is there there should be one proxy that is slower than the others.
 - Confirm from the trace that resolvers are waiting on pipelining, and preferably that one proxy is indeed slower
 - Look into or add proxy batching to traces
 - If there is indeed one slow proxy, then work on single proxy CPU usage and profiling
 - Sync with Trevor on his current state of proxy profiling