Skip to content

An Attempt to Fix the WS Tests with Intermittent Race Condition#1309

Open
genematx wants to merge 11 commits intobluesky:mainfrom
genematx:fix-ws-tests
Open

An Attempt to Fix the WS Tests with Intermittent Race Condition#1309
genematx wants to merge 11 commits intobluesky:mainfrom
genematx:fix-ws-tests

Conversation

@genematx
Copy link
Contributor

@genematx genematx commented Mar 13, 2026

This is an attempt to fix intermittent Timeout errors seen on CI/CD when running WebSockets-related tests, e.g. this.

It is likely that there exists a race condition between when the Redis client subscribes to new updates and receives replayed history. If during the replay, but before the subscription has started, any new updates are received, they would be missed.

Additionally, a check on the server has been added to ensure that the update is not duplicated when sending them to the client.

Client-side, the start_in_thread method now blocks until the connection is established.

It is not clear if this is the root cause of the problem with CI tests, but the behavior can be observed when running the following script:

from pathlib import Path
import tempfile
import threading
import numpy as np
import redis

from tiled.catalog import from_uri
from tiled.client import Context
from tiled.config import Authentication
from tiled.server.app import build_app
from tiled.client import from_context

import uuid

import time

redis_uri = "redis://localhost:6379/0"
redis.Redis.from_url(redis_uri).flushall()  # Clear Redis before running the test
tmpdir = Path(tempfile.gettempdir())

tree = from_uri(
    "sqlite:///:memory:",
    writable_storage=[
        f"file://localhost{str(tmpdir / 'data')}",
        f"duckdb:///{tmpdir / 'data.duckdb'}",
    ],
    readable_storage=[tmpdir.resolve()],
    init_if_not_exists=True,
    cache_config={
        "uri": redis_uri,
        "data_ttl": 600,  # 10 minutes
        "seq_ttl": 600,  # 10 minutes
        "socket_timeout": 600,  # 10 minutes
        "socket_connect_timeout": 10,
    },
)
app = build_app(
    tree,
    authentication=Authentication(
        single_user_api_key="secret",
        allow_anonymous_access=True,
    ),
)

context = Context.from_app(app)
client = from_context(context)

# Or use the Normal Redis client with an external server
# from tiled.client import from_context, from_uri as client_from_uri
# client = client_from_uri("http://localhost:8000", api_key="secret")['test/ws_test']

number_wrong_runs = 0

for i in range(100):
    print(f"\nRunning test iteration {i + 1}...")

    # Create streaming array node using Tiled client
    arr = np.arange(10)
    unique_key = f"test_stream_from_beginning_{uuid.uuid4().hex[:8]}"
    streaming_node = client.write_array(arr, key=unique_key)

    # Write first updates before subscribing
    for i in range(1, 5):
        first_update = arr + i
        streaming_node.write(first_update)

    # Set up subscription using the Subscription class
    received = []
    event = threading.Event()
    def callback(update):
        received.append(update)
        if len(received) >= 5:
            event.set()

    # Create subscription for the streaming node
    sub = streaming_node.subscribe()
    sub.new_data.add_callback(callback)

    # Start the subscription and write more updates
    with sub.start_in_thread(start=3):
        for i in range(5, 7):
            new_arr = arr + i
            streaming_node.write(new_arr)

        # time.sleep(0.02)  # Give the server time to process the updates
        event.wait(timeout=10.0)

        print(f"Total updates received: {len(received)}")
        if len(received) != 5:
            print("!!!Wrong number of updates received!!!")
            number_wrong_runs += 1

print(f"{number_wrong_runs=}")

Related: #1274

Checklist

  • Add a Changelog entry
  • Add the ticket number which this PR closes to the comment section

@genematx genematx changed the title WIP: Fixing WS Tests An Attempt to Fix the WS Tests with Intermittent Race Condition Mar 13, 2026
@genematx genematx marked this pull request as ready for review March 13, 2026 17:58
@danielballan
Copy link
Member

We can see from the CI results that this does not yet solve the problem, but it does seem to address a real issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants