Skip to content

Commit fc9514f

Browse files
authored
analyze failures for storage locks (#2379)
#### Reference Issues/PRs <!--Example: Fixes #1234. See also #3456.--> #### What does this implement or fix? Reproduced problem: https://github.com/man-group/ArcticDB/actions/runs/15417513896/job/43384819172 Shows not all processes can be spawned because lack of memory To help analysis of flaky tests also a workflow extension is made so that it is possible to run fully custom pytest commands like, this which runs in all VMs one test until an error, timeout (6hrs) or repeated successfully 100 times: ``` pytest -n auto -v --count=100 -x python/tests/integration/arcticdb/test_storage_lock.py ``` Try 1 - determine the the number of actual processes using is_alive() method: Fix: ce50eb2 Log: https://github.com/man-group/ArcticDB/actions/runs/15419424266/job/43391336706 Outcome: there are many errors on Windows and few on Linux. Perhaps this is not optimal one? Note that on Windows this approach is huge disaster: ``` DataFrame.iloc[:, 0] (column name="col") values are different (100.0 %) [index]: [0] [left]: [63] [right]: [100] !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 2 failures !!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!! xdist.dsession.Interrupted: stopping after 1 failures !!!!!!!!!!!! ``` Try 2 - Attempt to fix with having unique symbol being created after common protected with lock counter is increased and before the lock is released. That fix relies on assumption that there should be as many unique symbols created as the number of actual running and live threads, Thus the common counter should always be equal in perfect case, but never lower than the number of created symbols. Fix: c85adf6 Log: https://github.com/man-group/ArcticDB/actions/runs/15434747123/job/43439581916 Analysis: we see many errors on windows like: ``` https://github.com/man-group/ArcticDB/actions/runs/15434747123/job/43439415186 The hosted runner lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.' https://github.com/man-group/ArcticDB/actions/runs/15434747123/job/43439450923 Process completed with exit code -1073741571. The exit code -1073741571 in pytest usually indicates a stack overflow or memory exhaustion issue, particularly on Windows systems. This corresponds to the Windows error code 0xC00000FD, which means the process ran out of stack space. https://github.com/man-group/ArcticDB/actions/runs/15434747123/job/43439404387 Error: Process completed with exit code 127. https://github.com/man-group/ArcticDB/actions/runs/15434747123/job/43439551901 Out of memory ``` This means 100 processes is to much for windows, we need to cut them so that no host problem are experienced Try 3 - on Windows max processes are 30 not 100 Log: https://github.com/man-group/ArcticDB/actions/runs/15444657461/job/43471434052 Analysis: No errors based on lack of memory seen. However there are failures on Windows still which cannot be explained: see https://github.com/man-group/ArcticDB/actions/runs/15444657461/job/43472314667 note that all failures are off by 1 - epected vs actual counter. This means the logic is working, but casts shadow that there could be some other issue related to windows - could be a bug also. ``` 2025-06-04 14:49:28,647 - tests.integration.arcticdb.test_storage_lock - INFO - Process 3380: start read 2025-06-04 14:49:31,226 - tests.integration.arcticdb.test_storage_lock - INFO - Process 3380: previous value 2 20250604 14:49:31.368214 8236 E arcticdb | Unexpectedly lost the lock in heartbeating thread. Maybe lock timeout is too small. E20250604 14:49:31.368520 8236 FunctionScheduler.cpp:507] Error running the scheduled function <Extend lock>: struct arcticdb::lock::LostReliableLock: Unknown exception 2025-06-04 14:49:31,445 - tests.integration.arcticdb.test_storage_lock - INFO - Process 7544: start read 2025-06-04 14:49:31,773 - tests.integration.arcticdb.test_storage_lock - INFO - Process 7544: previous value 2 2025-06-04 14:49:36,074 - tests.integration.arcticdb.test_storage_lock - INFO - Process 3380: incrementing and saving value 3 2025-06-04 14:49:36,105 - tests.integration.arcticdb.test_storage_lock - INFO - Process 7544: incrementing and saving value 3 ``` ------------------------- Try 4 - After discussion with Ivo we determined this is not a bug but will increase default timeout of storage lock to 20 seconds (even that is too low in practical usage scenarios, but is ok for tests) Log: https://github.com/man-group/ArcticDB/actions/runs/15464091500 Analysis - all tests successfull we have a winner! #### Any other comments? #### Checklist <details> <summary> Checklist for code changes... </summary> - [ ] Have you updated the relevant docstrings, documentation and copyright notice? - [ ] Is this contribution tested against [all ArcticDB's features](../docs/mkdocs/docs/technical/contributing.md)? - [ ] Do all exceptions introduced raise appropriate [error messages](https://docs.arcticdb.io/error_messages/)? - [ ] Are API changes highlighted in the PR description? - [ ] Is the PR labelled as enhancement or bug so it appears in autogenerated release notes? </details> <!-- Thanks for contributing a Pull Request to ArcticDB! Please ensure you have taken a look at: - ArcticDB's Code of Conduct: https://github.com/man-group/ArcticDB/blob/master/CODE_OF_CONDUCT.md - ArcticDB's Contribution Licensing: https://github.com/man-group/ArcticDB/blob/master/docs/mkdocs/docs/technical/contributing.md#contribution-licensing --> --------- Co-authored-by: Georgi Rusev <Georgi Rusev>
1 parent 7e76422 commit fc9514f

File tree

3 files changed

+41
-9
lines changed

3 files changed

+41
-9
lines changed

.github/workflows/build.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ on:
3535
type: string
3636
default: arcticdb-dev-clang:latest
3737
pytest_args:
38-
description: Rewrite what tests will run
38+
description: Rewrite what tests will run or do your own pytest line if string starts with pytest ... (Example -- pytest -n auto -v --count=50 -x python/tests/compat)
3939
type: string
4040
default: ""
4141
run-name: Building ${{github.ref_name}} on ${{github.event_name}} by ${{github.actor}}

.github/workflows/build_steps.yml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -361,7 +361,16 @@ jobs:
361361

362362
- name: Run test
363363
run: |
364-
build_tooling/parallel_test.sh tests/${{matrix.type}}
364+
if [[ "$(echo "$ARCTICDB_PYTEST_ARGS" | xargs)" == pytest* ]]; then
365+
echo "Run custom pytest command"
366+
python -m pip install pytest-repeat
367+
python -m pip install setuptools
368+
python -m pip install wheel
369+
python setup.py protoc --build-lib python
370+
eval "$ARCTICDB_PYTEST_ARGS"
371+
else
372+
build_tooling/parallel_test.sh tests/${{matrix.type}}
373+
fi
365374
env:
366375
TEST_OUTPUT_DIR: ${{runner.temp}}
367376
# Use the Mongo created in the service container above to test against

python/tests/integration/arcticdb/test_storage_lock.py

Lines changed: 30 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,54 @@
1+
import os
12
import pandas as pd
23
import numpy as np
34
import pytest
45
import sys
56

7+
from arcticdb.util.utils import get_logger
68
from arcticdb_ext.tools import ReliableStorageLock, ReliableStorageLockManager
7-
from tests.util.mark import REAL_S3_TESTS_MARK
9+
from tests.util.mark import REAL_S3_TESTS_MARK, WINDOWS
810

911
import time
1012

1113
from arcticdb.util.test import assert_frame_equal
1214
from multiprocessing import Process
1315

16+
logger = get_logger()
1417

1518
one_sec = 1_000_000_000
1619

20+
symbol_prefix = "process_id_"
21+
22+
max_processes = 30 if WINDOWS else 100 # Too many processes will trigger out of mem on windows
23+
storage_lock_timeout_sec = 20 if WINDOWS else 10 # For Windows choosing longer wait for default storage lock timeout
24+
1725

1826
def slow_increment_task(real_storage_factory, lib_name, symbol, sleep_time):
1927
# We need to explicitly build the library object in each process, otherwise the s3 library doesn't get copied
2028
# properly between processes, and we get spurious `XAmzContentSHA256Mismatch` errors.
29+
pid = os.getpid()
30+
logger.info(f"Process {pid}: initiated")
2131
fixture = real_storage_factory.create_fixture()
2232
lib = fixture.create_arctic()[lib_name]
23-
lock = ReliableStorageLock("test_lock", lib._nvs._library, 10 * one_sec)
33+
lock = ReliableStorageLock("test_lock", lib._nvs._library, storage_lock_timeout_sec * one_sec)
2434
lock_manager = ReliableStorageLockManager()
2535
lock_manager.take_lock_guard(lock)
36+
logger.info(f"Process {pid}: start read")
2637
df = lib.read(symbol).data
27-
df["col"][0] = df["col"][0] + 1
38+
logger.info(f"Process {pid}: previous value {df['col'][0]}")
39+
df["col", 0] = df["col"][0] + 1
2840
time.sleep(sleep_time)
2941
lib.write(symbol, df)
42+
logger.info(f"Process {pid}: incrementing and saving value {df['col'][0]}")
43+
symbol_name = f"{symbol_prefix}{pid}"
44+
lib.write(symbol_name, df)
45+
logger.info(f"Process {pid}: wrote unique symbol {symbol_name}")
3046
lock_manager.free_lock_guard()
47+
logger.info(f"Process {pid}: completed")
3148

32-
33-
@pytest.mark.parametrize("num_processes,max_sleep", [(100, 1), (5, 20)])
49+
# NOTE: Is there is not enough memory the number of actually spawned processes
50+
# will be lowe. The test counts the actual processes that did really got executed
51+
@pytest.mark.parametrize("num_processes,max_sleep", [(max_processes, 1), (5, 2 * storage_lock_timeout_sec)])
3452
@REAL_S3_TESTS_MARK
3553
@pytest.mark.storage
3654
def test_many_increments(real_storage_factory, lib_name, num_processes, max_sleep):
@@ -51,8 +69,13 @@ def test_many_increments(real_storage_factory, lib_name, num_processes, max_slee
5169
for p in processes:
5270
p.join()
5371

72+
symbols = lib.list_symbols(regex=f"{symbol_prefix}.*")
73+
num_processes_succeeded = len(symbols)
74+
logger.info(f"Total number liver processes{num_processes_succeeded}")
75+
logger.info(f"{symbols}")
76+
5477
vit = lib.read(symbol)
5578
read_df = vit.data
56-
expected_df = pd.DataFrame({"col": [num_processes]})
79+
expected_df = pd.DataFrame({"col": [num_processes_succeeded]})
5780
assert_frame_equal(read_df, expected_df)
58-
assert vit.version == num_processes
81+
assert vit.version == num_processes_succeeded

0 commit comments

Comments
 (0)