Symbol list and storage lock improvements #2359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

ognyanstoimenov wants to merge 5 commits into master from symbol_list_slow_writes_tests

Collaborator

ognyanstoimenov commented May 16, 2025

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

github-actions bot commented May 16, 2025

Label error. Requires exactly 1 of: patch, minor, major. Found:

ognyanstoimenov force-pushed the symbol_list_slow_writes_tests branch 2 times, most recently from d6556b5 to d464ce0 Compare

May 19, 2025 10:24

IvoDD requested changes

View reviewed changes

cpp/arcticdb/storage/failure_simulation.hpp Show resolved Hide resolved

cpp/arcticdb/storage/failure_simulation.hpp Show resolved Hide resolved

python/tests/stress/arcticdb/version_store/test_stress_symbol_list_cache.py Outdated



		def compact_symbol_list_worker(real_s3_storage_factory, lib_name, run_time):
		# Decrease the lock wait times to make lock failures more likely

Collaborator

IvoDD May 20, 2025

What do you think about attaching a FailureSimulator to the real_s3_storage and introducing slowdowns instead of having to decrease the timeout?

I'm 80% sure that with such a low WaitMs almost none of the storage locks will succeed because they'll take more than 1.5ms.

python/tests/stress/arcticdb/version_store/test_stress_symbol_list_cache.py Outdated

+                  while time.time() - start_time < run_time:
+                      id = cnt * step_symbol_id + first_symbol_id
+                      lib.write(f"sym_{id}", df)
+                      cnt += 1

Collaborator

IvoDD May 20, 2025

Should we introduce some (maybe optional) sleeps? Feels like a very high stress test if we do writes and compactions in a very hot loop.

I think without sleeps the different threads might be less interleaved than we expect.

Collaborator Author

ognyanstoimenov Jun 6, 2025

Will just remove the loop and use FailureSimulator with some probability, so each process writes/compacts only once as it is more deterministic

python/tests/stress/arcticdb/version_store/test_stress_symbol_list_cache.py

+                  while not results_queue.empty():
+                      first_id, cnt = results_queue.get()
+                      expected_symbol_list.update([f"sym_{first_id + i*num_writers}" for i in range(cnt)])

Collaborator

IvoDD May 20, 2025

It would be nice to verify the state of the symbol list entries after all threads are finished.

E.g. there is just one compacted symbol list entry. This would mean no racing compactions happened and at least one compaction succeeded (which I think is not the case right now because of the previous comment about a too low WaitMs)

cpp/arcticdb/util/storage_lock.hpp Show resolved Hide resolved

cpp/arcticdb/util/storage_lock.hpp

    
                      do_wait:

                      while (ref_key_exists(store)) {

                      while (!try_acquire_lock(store)) {

Collaborator

IvoDD May 20, 2025

We have introduced a new failure mode which can be caused by slow writes. We've seen writes getting very slow for prolonged intervals of time on vast. I'm worried that the default timeout_ms = std::nullopt can cause us to loop indefinetely without any indication as to what's going on.

I think we should either:

Not allow nullopt timeout
Introduce logging inside the loop saying Failed to acqure lock, will attampt again in Xms
Promote the debug logging due to long write to at least an info log

cpp/arcticdb/util/storage_lock.hpp

+                          ARCTICDB_DEBUG(log::lock(), "Waited for {} ms, thread id: {}", lock_sleep_ms, std::this_thread::get_id());
+                          auto read_ts = read_timestamp(store);
+                          auto duration = ClockType::coarse_nanos_since_epoch() - start;
+                          auto duration_in_ms = duration / ONE_MILLISECOND;

Collaborator

IvoDD May 20, 2025

Nit: I think you'll need a [[maybe_unused]] because the preprocessing in the debug logs can hide its usage.

cpp/arcticdb/util/storage_lock.hpp Outdated Show resolved Hide resolved

cpp/arcticdb/util/storage_lock.hpp

@@ @@ -184,6 +194,7 @@ class StorageLock { @@
                   timestamp create_ref_key(const std::shared_ptr<Store>& store) {
                       auto ts =  ClockType::nanos_since_epoch();
+                      StorageFailureSimulator::instance()->go(FailureType::WRITE_LOCK);

Collaborator

IvoDD May 20, 2025

I don't think that having a separate failure type for the lock is useful.

I think that:

Allowing slowdowns for other write operations is useful
Allowing regular write exception simulations could also be useful for the storage lock
Calling FailureSimulator->go on a higher level than the other calls can be confusing.

I don't feel too strongly about this, so happy to leave as is if it is a lot of work to refactor.

ognyanstoimenov added 5 commits

June 3, 2025 14:13


          Temp, dont merge

b296888


          Refactor try_lock and lock to use common code

ec0e0d9


          Refactor, still poc

926714e


          Add test + some comments addressed

98b5459


          temp

1cfa988

ognyanstoimenov force-pushed the symbol_list_slow_writes_tests branch from fbfd7dd to 1cfa988 Compare

June 8, 2025 19:34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

IvoDD IvoDD requested changes

alexowens90 Awaiting requested review from alexowens90 alexowens90 will be requested when the pull request is marked ready for review alexowens90 is a code owner

poodlewars Awaiting requested review from poodlewars poodlewars will be requested when the pull request is marked ready for review poodlewars is a code owner

Requested changes must be addressed to merge this pull request.

Labels

None yet