analyze failures for storage locks (#2379)

grusev · web-flow · commit fc9514f25712 · 2025-06-06T16:19:45.000+03:00
#### Reference Issues/PRs  #### What does this implement or fix? Reproduced problem: https://github.com/man-group/ArcticDB/actions/runs/15417513896/job/43384819172 Shows not all processes can be spawned because lack of memory To help analysis of flaky tests also a workflow extension is made so that it is possible to run fully custom pytest commands like, this which runs in all VMs one test until an error, timeout (6hrs) or repeated successfully 100 times: ``` pytest -n auto -v --count=100 -x python/tests/integration/arcticdb/test_storage_lock.py ``` Try 1 - determine the the number of actual processes using is_alive() method: Fix: ce50eb2 Log: https://github.com/man-group/ArcticDB/actions/runs/15419424266/job/43391336706 Outcome: there are many errors on Windows and few on Linux. Perhaps this is not optimal one? Note that on Windows this approach is huge disaster: ``` DataFrame.iloc[:, 0] (column name="col") values are different (100.0 %) [index]: [0] [left]: [63] [right]: [100] !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 2 failures !!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!! ``` Try 2 - Attempt to fix with having unique symbol being created after common protected with lock counter is increased and before the lock is released. That fix relies on assumption that there should be as many unique symbols created as the number of actual running and live threads, Thus the common counter should always be equal in perfect case, but never lower than the number of created symbols. Fix: c85adf6 Log: https://github.com/man-group/ArcticDB/actions/runs/15434747123/job/43439581916 Analysis: we see many errors on windows like: ``` https://github.com/man-group/ArcticDB/actions/runs/15434747123/job/43439415186 The hosted runner lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.' https://github.com/man-group/ArcticDB/actions/runs/15434747123/job/43439450923 Process completed with exit code -1073741571. The exit code -1073741571 in pytest usually indicates a stack overflow or memory exhaustion issue, particularly on Windows systems. This corresponds to the Windows error code 0xC00000FD, which means the process ran out of stack space. https://github.com/man-group/ArcticDB/actions/runs/15434747123/job/43439404387 Error: Process completed with exit code 127. https://github.com/man-group/ArcticDB/actions/runs/15434747123/job/43439551901 Out of memory ``` This means 100 processes is to much for windows, we need to cut them so that no host problem are experienced Try 3 - on Windows max processes are 30 not 100 Log: https://github.com/man-group/ArcticDB/actions/runs/15444657461/job/43471434052 Analysis: No errors based on lack of memory seen. However there are failures on Windows still which cannot be explained: see https://github.com/man-group/ArcticDB/actions/runs/15444657461/job/43472314667 note that all failures are off by 1 - epected vs actual counter. This means the logic is working, but casts shadow that there could be some other issue related to windows - could be a bug also. ``` 2025-06-04 14:49:28,647 - tests.integration.arcticdb.test_storage_lock - INFO - Process 3380: start read 2025-06-04 14:49:31,226 - tests.integration.arcticdb.test_storage_lock - INFO - Process 3380: previous value 2 20250604 14:49:31.368214 8236 E arcticdb | Unexpectedly lost the lock in heartbeating thread. Maybe lock timeout is too small. E20250604 14:49:31.368520 8236 FunctionScheduler.cpp:507] Error running the scheduled function <Extend lock>: struct arcticdb::lock::LostReliableLock: Unknown exception 2025-06-04 14:49:31,445 - tests.integration.arcticdb.test_storage_lock - INFO - Process 7544: start read 2025-06-04 14:49:31,773 - tests.integration.arcticdb.test_storage_lock - INFO - Process 7544: previous value 2 2025-06-04 14:49:36,074 - tests.integration.arcticdb.test_storage_lock - INFO - Process 3380: incrementing and saving value 3 2025-06-04 14:49:36,105 - tests.integration.arcticdb.test_storage_lock - INFO - Process 7544: incrementing and saving value 3 ``` ------------------------- Try 4 - After discussion with Ivo we determined this is not a bug but will increase default timeout of storage lock to 20 seconds (even that is too low in practical usage scenarios, but is ok for tests) Log: https://github.com/man-group/ArcticDB/actions/runs/15464091500 Analysis - all tests successfull we have a winner! #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details>  --------- Co-authored-by: Georgi Rusev <Georgi Rusev>
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -35,7 +35,7 @@ on:
         type: string
         default: arcticdb-dev-clang:latest
       pytest_args:
-        description: Rewrite what tests will run
+        description: Rewrite what tests will run or do your own pytest line if string starts with pytest ... (Example -- pytest -n auto -v --count=50 -x python/tests/compat)
         type: string
         default: ""
 run-name: Building ${{github.ref_name}} on ${{github.event_name}} by ${{github.actor}}
diff --git a/.github/workflows/build_steps.yml b/.github/workflows/build_steps.yml
@@ -361,7 +361,16 @@ jobs:
 
       - name: Run test
         run: |
-          build_tooling/parallel_test.sh tests/${{matrix.type}}
+          if [[ "$(echo "$ARCTICDB_PYTEST_ARGS" | xargs)" == pytest* ]]; then
+            echo "Run custom pytest command"
+            python -m pip  install pytest-repeat
+            python -m pip  install setuptools 
+            python -m pip  install wheel
+            python setup.py protoc --build-lib python
+            eval "$ARCTICDB_PYTEST_ARGS"
+          else
+            build_tooling/parallel_test.sh tests/${{matrix.type}}
+          fi
         env:
           TEST_OUTPUT_DIR: ${{runner.temp}}
           # Use the Mongo created in the service container above to test against
diff --git a/python/tests/integration/arcticdb/test_storage_lock.py b/python/tests/integration/arcticdb/test_storage_lock.py
@@ -1,36 +1,54 @@
+import os
 import pandas as pd
 import numpy as np
 import pytest
 import sys
 
+from arcticdb.util.utils import get_logger
 from arcticdb_ext.tools import ReliableStorageLock, ReliableStorageLockManager
-from tests.util.mark import REAL_S3_TESTS_MARK
+from tests.util.mark import REAL_S3_TESTS_MARK, WINDOWS
 
 import time
 
 from arcticdb.util.test import assert_frame_equal
 from multiprocessing import Process
 
+logger = get_logger()
 
 one_sec = 1_000_000_000
 
+symbol_prefix = "process_id_"
+
+max_processes = 30 if WINDOWS else 100 # Too many processes will trigger out of mem on windows
+storage_lock_timeout_sec = 20 if WINDOWS else 10 # For Windows choosing longer wait for default storage lock timeout
+
 
 def slow_increment_task(real_storage_factory, lib_name, symbol, sleep_time):
     # We need to explicitly build the library object in each process, otherwise the s3 library doesn't get copied
     # properly between processes, and we get spurious `XAmzContentSHA256Mismatch` errors.
+    pid = os.getpid()
+    logger.info(f"Process {pid}: initiated")
     fixture = real_storage_factory.create_fixture()
     lib = fixture.create_arctic()[lib_name]
-    lock = ReliableStorageLock("test_lock", lib._nvs._library, 10 * one_sec)
+    lock = ReliableStorageLock("test_lock", lib._nvs._library, storage_lock_timeout_sec * one_sec)
     lock_manager = ReliableStorageLockManager()
     lock_manager.take_lock_guard(lock)
+    logger.info(f"Process {pid}: start read")
     df = lib.read(symbol).data
-    df["col"][0] = df["col"][0] + 1
+    logger.info(f"Process {pid}: previous value {df['col'][0]}")
+    df["col", 0] = df["col"][0] + 1
     time.sleep(sleep_time)
     lib.write(symbol, df)
+    logger.info(f"Process {pid}: incrementing and saving value {df['col'][0]}")
+    symbol_name = f"{symbol_prefix}{pid}"
+    lib.write(symbol_name, df)
+    logger.info(f"Process {pid}: wrote unique symbol {symbol_name}")
     lock_manager.free_lock_guard()
+    logger.info(f"Process {pid}: completed")
 
-
-@pytest.mark.parametrize("num_processes,max_sleep", [(100, 1), (5, 20)])
+# NOTE: Is there is not enough memory the number of actually spawned processes
+# will be lowe. The test counts the actual processes that did really got executed
+@pytest.mark.parametrize("num_processes,max_sleep", [(max_processes, 1), (5, 2 * storage_lock_timeout_sec)])
 @REAL_S3_TESTS_MARK
 @pytest.mark.storage
 def test_many_increments(real_storage_factory, lib_name, num_processes, max_sleep):
@@ -51,8 +69,13 @@ def test_many_increments(real_storage_factory, lib_name, num_processes, max_slee
     for p in processes:
         p.join()
 
+    symbols = lib.list_symbols(regex=f"{symbol_prefix}.*")
+    num_processes_succeeded = len(symbols)
+    logger.info(f"Total number liver processes{num_processes_succeeded}")
+    logger.info(f"{symbols}")
+
     vit = lib.read(symbol)
     read_df = vit.data
-    expected_df = pd.DataFrame({"col": [num_processes]})
+    expected_df = pd.DataFrame({"col": [num_processes_succeeded]})
     assert_frame_equal(read_df, expected_df)
-    assert vit.version == num_processes
+    assert vit.version == num_processes_succeeded