-
Notifications
You must be signed in to change notification settings - Fork 3.9k
feat: add race tolerance to query-tee #20228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a race tolerance feature to the query-tee tool to better measure if Thor (a backend) is "good enough" to replace the chunks backend. The feature introduces a configurable tolerance period (default 100ms) that handicaps races in favor of non-preferred backends. If the preferred backend (v1) wins a race but a non-preferred backend (Thor) finishes within the tolerance window, Thor is declared the winner. This helps determine if performance differences are negligible from a user perspective.
Key Changes:
- Added
RaceToleranceconfiguration field with a default of 100ms - Modified race logic to wait for the tolerance period when the preferred backend wins first, giving non-preferred backends a chance to win if they finish within tolerance
- Refactored
Domethod by extracting helper functions (shouldSample,makeBackendRequests,finishRace) for improved code organization and readability
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/querytee/proxy.go | Adds RaceTolerance configuration field and flag registration with 100ms default |
| tools/querytee/handler_factory.go | Threads the RaceTolerance configuration through the handler factory |
| tools/querytee/fanout_handler.go | Implements race tolerance logic in the Do method and refactors code into helper functions |
| tools/querytee/fanout_handler_test.go | Updates existing race mode test to validate the new tolerance behavior |
| tools/querytee/proxy_endpoint.go | Adds TODO comment about detecting Grafana as an issuer (unrelated to main changes) |
Comments suppressed due to low confidence (2)
tools/querytee/fanout_handler_test.go:257
- Missing test coverage for critical edge case: When race tolerance is enabled and the preferred backend wins, the code waits for the tolerance period to see if the non-preferred backend finishes. However, there's no test for the scenario where the non-preferred backend returns an error or fails within the tolerance period. This is critical because the current code would incorrectly return the failed response (see bug in lines 130-138). Add a test case that validates the behavior when the preferred backend succeeds quickly, but the non-preferred backend fails within the race tolerance window.
func TestFanOutHandler_Do_RaceModeReturnsNonPreferredIfWithinTolerance(t *testing.T) {
backend1 := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
time.Sleep(10 * time.Millisecond)
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_, err := w.Write([]byte(`{"status":"success","data":{"resultType":"streams","result":[{"stream":{"backend":"1"},"values":[["1000000000","log line 1"]]}]}}`))
require.NoError(t, err)
}))
defer backend1.Close()
backend2 := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
time.Sleep(50 * time.Millisecond)
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_, err := w.Write([]byte(`{"status":"success","data":{"resultType":"streams","result":[{"stream":{"backend":"2"},"values":[["1000000000","log line 2"]]}]}}`))
require.NoError(t, err)
}))
defer backend2.Close()
backend1URL, _ := url.Parse(backend1.URL)
backend2URL, _ := url.Parse(backend2.URL)
backends := []*ProxyBackend{
NewProxyBackend("backend-1", backend1URL, 5*time.Second, true),
NewProxyBackend("backend-2", backend2URL, 5*time.Second, false),
}
handler := NewFanOutHandler(FanOutHandlerConfig{
Backends: backends,
Codec: queryrange.DefaultCodec,
Logger: log.NewNopLogger(),
Metrics: NewProxyMetrics(prometheus.NewRegistry()),
RouteName: "test_route",
EnableRace: true,
RaceTolerance: 100 * time.Millisecond,
})
req := &queryrange.LokiRequest{
Query: `{app="test"}`,
StartTs: time.Now().Add(-1 * time.Hour),
EndTs: time.Now(),
Limit: 100,
}
ctx := user.InjectOrgID(context.Background(), "test-tenant")
resp, err := handler.Do(ctx, req)
require.NoError(t, err)
require.NotNil(t, resp)
lokiResp, ok := resp.(*queryrange.LokiResponse)
require.True(t, ok)
require.Equal(t, "success", lokiResp.Status)
require.Len(t, lokiResp.Data.Result, 1)
require.Contains(t, lokiResp.Data.Result[0].Labels, `backend="2"`)
time.Sleep(100 * time.Millisecond)
}
tools/querytee/fanout_handler_test.go:257
- Missing test coverage for the race tolerance timeout scenario. The test verifies that a non-preferred backend can win if it finishes within the tolerance period, but there's no test for when the non-preferred backend takes longer than the tolerance period. Add a test case where the preferred backend finishes first and the non-preferred backend takes longer than the race tolerance (e.g., preferred at 10ms, non-preferred at 150ms, tolerance at 100ms), to verify that the preferred backend is correctly declared the winner when the timeout expires.
func TestFanOutHandler_Do_RaceModeReturnsNonPreferredIfWithinTolerance(t *testing.T) {
backend1 := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
time.Sleep(10 * time.Millisecond)
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_, err := w.Write([]byte(`{"status":"success","data":{"resultType":"streams","result":[{"stream":{"backend":"1"},"values":[["1000000000","log line 1"]]}]}}`))
require.NoError(t, err)
}))
defer backend1.Close()
backend2 := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
time.Sleep(50 * time.Millisecond)
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_, err := w.Write([]byte(`{"status":"success","data":{"resultType":"streams","result":[{"stream":{"backend":"2"},"values":[["1000000000","log line 2"]]}]}}`))
require.NoError(t, err)
}))
defer backend2.Close()
backend1URL, _ := url.Parse(backend1.URL)
backend2URL, _ := url.Parse(backend2.URL)
backends := []*ProxyBackend{
NewProxyBackend("backend-1", backend1URL, 5*time.Second, true),
NewProxyBackend("backend-2", backend2URL, 5*time.Second, false),
}
handler := NewFanOutHandler(FanOutHandlerConfig{
Backends: backends,
Codec: queryrange.DefaultCodec,
Logger: log.NewNopLogger(),
Metrics: NewProxyMetrics(prometheus.NewRegistry()),
RouteName: "test_route",
EnableRace: true,
RaceTolerance: 100 * time.Millisecond,
})
req := &queryrange.LokiRequest{
Query: `{app="test"}`,
StartTs: time.Now().Add(-1 * time.Hour),
EndTs: time.Now(),
Limit: 100,
}
ctx := user.InjectOrgID(context.Background(), "test-tenant")
resp, err := handler.Do(ctx, req)
require.NoError(t, err)
require.NotNil(t, resp)
lokiResp, ok := resp.(*queryrange.LokiResponse)
require.True(t, ok)
require.Equal(t, "success", lokiResp.Status)
require.Len(t, lokiResp.Data.Result, 1)
require.Contains(t, lokiResp.Data.Result[0].Labels, `backend="2"`)
time.Sleep(100 * time.Millisecond)
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]> Signed-off-by: Trevor Whitney <[email protected]>
What this PR does / why we need it:
Add a race tolerance to the query-tee. This allows us to handicap races in favor or Thor. If the v1 backend wins, but Thor finishes within the race tolerance of v1 (default is 100ms), then we call that a win for Thor. The logic here being that 100ms is undetectable by a user, so it helps us better measure if "thor is good enough" (to replace chunks) rather than just "is thor faster".
Special notes for your reviewer:
Checklist
CONTRIBUTING.mdguide (required)featPRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.docs/sources/setup/upgrade/_index.mddeprecated-config.yamlanddeleted-config.yamlfiles respectively in thetools/deprecated-config-checkerdirectory. Example PR