Skip to content

[test optimization] Add filesystem cache for test optimization API requests#7919

Merged
juan-fernandez merged 16 commits intomasterfrom
juan-fernandez/known-tests-fs-cache
Apr 6, 2026
Merged

[test optimization] Add filesystem cache for test optimization API requests#7919
juan-fernandez merged 16 commits intomasterfrom
juan-fernandez/known-tests-fs-cache

Conversation

@juan-fernandez
Copy link
Copy Markdown
Collaborator

@juan-fernandez juan-fernandez commented Apr 2, 2026

What does this PR do?

Adds an opt-in filesystem cache (os.tmpdir()) for three test optimization API endpoints: known tests, skippable suites, and test management tests. When enabled via DD_EXPERIMENTAL_TEST_REQUESTS_FS_CACHE, the first process to request data acquires an exclusive lock, fetches from the API, and writes the result to a shared cache file. Concurrent processes wait for the cache to appear instead of making redundant requests. Cache entries expire after 30 minutes.

A shared fs-cache.js module provides a reusable withCache() wrapper with:

  • Deterministic cache keys (SHA-256 of JSON-serialized request parameters, prefixed per endpoint)
  • Cross-process deduplication via O_CREAT|O_EXCL lock files
  • Lock heartbeat (every 30s, atomic via temp+rename) to prevent false stale detection during slow paginated fetches
  • Atomic writes throughout (temp file + rename) to prevent partial reads on both cache and lock files
  • Stale lock recovery with atomic takeover: waiters re-acquire the lock before fetching, preventing thundering herd on crash recovery

Motivation

In monorepo setups using tools like lage with potentially thousands of parallel jest sessions, every session independently fetches the same data from the API.

In monorepo setups with thousands of parallel jest sessions (e.g. lage
with 3000+ packages), every session independently fetches the same known
tests from the API. With 200k+ tests and cursor-based pagination, this
causes massive redundant network traffic and delays session startup.

Add a filesystem cache in os.tmpdir() keyed on (sha, service, env,
repositoryUrl, configurations). The first session acquires an exclusive
lock (O_CREAT|O_EXCL), fetches from the API, and writes the cache
atomically. Concurrent sessions poll for the cache file to appear.
Cache entries expire after 30 minutes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Integrate both the filesystem cache (this branch) and cursor-based
pagination (from master #7866) into fetchFromApi. Also fix a bug where
writeToCache referenced the old `knownTests` variable instead of
`aggregateTests`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Overall package size

Self size: 5.46 MB
Deduped: 6.3 MB
No deduping: 6.3 MB

Dependency sizes | name | version | self size | total size | |------|---------|-----------|------------| | import-in-the-middle | 3.0.0 | 81.15 kB | 815.98 kB | | dc-polyfill | 0.1.10 | 26.73 kB | 26.73 kB |

🤖 This report was automatically generated by heaviest-objects-in-the-universe

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 87.82609% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.43%. Comparing base (ce653ab) to head (f2c09cc).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
...es/dd-trace/src/ci-visibility/requests/fs-cache.js 85.10% 14 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7919      +/-   ##
==========================================
+ Coverage   74.26%   74.43%   +0.17%     
==========================================
  Files         765      766       +1     
  Lines       35786    35906     +120     
==========================================
+ Hits        26575    26727     +152     
+ Misses       9211     9179      -32     
Flag Coverage Δ
aiguard-macos 39.41% <ø> (-0.10%) ⬇️
aiguard-ubuntu 39.53% <ø> (-0.10%) ⬇️
aiguard-windows 39.20% <ø> (-0.10%) ⬇️
apm-capabilities-tracing-macos 49.51% <87.82%> (+0.31%) ⬆️
apm-capabilities-tracing-ubuntu 49.42% <87.82%> (+0.31%) ⬆️
apm-capabilities-tracing-windows 49.21% <87.82%> (+0.23%) ⬆️
apm-integrations-child-process 38.74% <ø> (-0.10%) ⬇️
apm-integrations-couchbase-18 37.52% <ø> (-0.13%) ⬇️
apm-integrations-couchbase-eol 38.04% <ø> (-0.10%) ⬇️
apm-integrations-oracledb 37.87% <ø> (-0.23%) ⬇️
appsec-express 55.39% <ø> (-0.07%) ⬇️
appsec-fastify 51.72% <ø> (-0.07%) ⬇️
appsec-graphql 51.88% <ø> (-0.07%) ⬇️
appsec-kafka 44.49% <ø> (-0.08%) ⬇️
appsec-ldapjs 44.11% <ø> (-0.08%) ⬇️
appsec-lodash 43.71% <ø> (-0.08%) ⬇️
appsec-macos 58.15% <ø> (-0.07%) ⬇️
appsec-mongodb-core 48.90% <ø> (-0.08%) ⬇️
appsec-mongoose 49.55% <ø> (-0.08%) ⬇️
appsec-mysql 51.08% <ø> (-0.07%) ⬇️
appsec-node-serialize 43.29% <ø> (-0.08%) ⬇️
appsec-passport 47.76% <ø> (-0.09%) ⬇️
appsec-postgres 50.70% <ø> (-0.07%) ⬇️
appsec-sourcing 42.54% <ø> (-0.08%) ⬇️
appsec-stripe 44.73% <ø> (-0.09%) ⬇️
appsec-template 43.46% <ø> (-0.08%) ⬇️
appsec-ubuntu 58.23% <ø> (-0.07%) ⬇️
appsec-windows 57.95% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-bluebird 32.32% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-body-parser 40.63% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-child_process 38.07% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-cookie-parser 34.35% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-express 34.67% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-express-mongo-sanitize 34.48% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-express-session 40.27% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-fs 32.00% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-generic-pool 29.46% <ø> (ø)
instrumentations-instrumentation-http 39.99% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-knex 32.39% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-mongoose 33.51% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-multer 40.38% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-mysql2 38.40% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-passport 44.16% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-passport-http 43.84% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-passport-local 44.37% <ø> (-0.09%) ⬇️
instrumentations-instrumentation-pg 37.84% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-promise 32.25% <ø> (-0.11%) ⬇️
instrumentations-instrumentation-promise-js 32.26% <ø> (-0.11%) ⬇️
instrumentations-instrumentation-q 32.30% <ø> (-0.11%) ⬇️
instrumentations-instrumentation-url 32.22% <ø> (-0.11%) ⬇️
instrumentations-instrumentation-when 32.27% <ø> (-0.11%) ⬇️
llmobs-ai 41.61% <ø> (-0.10%) ⬇️
llmobs-anthropic 40.84% <ø> (+0.18%) ⬆️
llmobs-bedrock 39.32% <ø> (-0.08%) ⬇️
llmobs-google-genai 39.87% <ø> (-0.09%) ⬇️
llmobs-langchain 39.34% <ø> (-0.09%) ⬇️
llmobs-openai 44.12% <ø> (-0.05%) ⬇️
llmobs-vertex-ai 40.13% <ø> (-0.09%) ⬇️
platform-core 31.47% <ø> (ø)
platform-esbuild 34.42% <ø> (ø)
platform-instrumentations-misc 34.11% <ø> (ø)
platform-shimmer 37.56% <ø> (ø)
platform-unit-guardrails 32.89% <ø> (ø)
platform-webpack 19.96% <ø> (ø)
plugins-azure-durable-functions 25.74% <ø> (ø)
plugins-azure-event-hubs 25.90% <ø> (ø)
plugins-azure-service-bus 25.26% <ø> (ø)
plugins-bullmq 43.60% <ø> (-0.10%) ⬇️
plugins-cassandra 38.02% <ø> (-0.10%) ⬇️
plugins-cookie 26.96% <ø> (ø)
plugins-cookie-parser 26.75% <ø> (ø)
plugins-crypto 26.73% <ø> (ø)
plugins-dd-trace-api 38.43% <ø> (-0.10%) ⬇️
plugins-express-mongo-sanitize 26.89% <ø> (ø)
plugins-express-session 26.70% <ø> (ø)
plugins-fastify 42.36% <ø> (-0.09%) ⬇️
plugins-fetch 38.51% <ø> (-0.12%) ⬇️
plugins-fs 38.75% <ø> (-0.10%) ⬇️
plugins-generic-pool 25.94% <ø> (ø)
plugins-google-cloud-pubsub 45.69% <ø> (-0.09%) ⬇️
plugins-grpc 41.01% <ø> (-0.09%) ⬇️
plugins-handlebars 26.94% <ø> (ø)
plugins-hapi 40.27% <ø> (-0.24%) ⬇️
plugins-hono 40.61% <ø> (-0.24%) ⬇️
plugins-ioredis 38.60% <ø> (-0.10%) ⬇️
plugins-knex 26.57% <ø> (ø)
plugins-langgraph 37.99% <ø> (-0.10%) ⬇️
plugins-ldapjs 24.43% <ø> (ø)
plugins-light-my-request 26.30% <ø> (ø)
plugins-limitd-client 32.60% <ø> (-0.10%) ⬇️
plugins-lodash 26.03% <ø> (ø)
plugins-mariadb 39.61% <ø> (-0.10%) ⬇️
plugins-memcached 38.34% <ø> (-0.10%) ⬇️
plugins-microgateway-core 39.34% <ø> (-0.10%) ⬇️
plugins-moleculer 40.63% <ø> (-0.09%) ⬇️
plugins-mongodb 39.27% <ø> (-0.10%) ⬇️
plugins-mongodb-core 39.12% <ø> (-0.10%) ⬇️
plugins-mongoose 38.92% <ø> (-0.19%) ⬇️
plugins-multer 26.70% <ø> (ø)
plugins-mysql 39.45% <ø> (-0.10%) ⬇️
plugins-mysql2 39.40% <ø> (-0.10%) ⬇️
plugins-node-serialize 27.00% <ø> (ø)
plugins-opensearch 37.74% <ø> (-0.10%) ⬇️
plugins-passport-http 26.76% <ø> (ø)
plugins-postgres 35.59% <ø> (-0.06%) ⬇️
plugins-process 26.73% <ø> (ø)
plugins-pug 26.96% <ø> (ø)
plugins-redis 39.04% <ø> (-0.10%) ⬇️
plugins-router 43.22% <ø> (-0.24%) ⬇️
plugins-sequelize 25.55% <ø> (ø)
plugins-test-and-upstream-amqp10 38.62% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-amqplib 44.37% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-apollo 39.24% <ø> (-0.09%) ⬇️
plugins-test-and-upstream-avsc 38.69% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-bunyan 33.93% <ø> (-0.25%) ⬇️
plugins-test-and-upstream-connect 40.93% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-graphql 40.28% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-koa 40.52% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-protobufjs 38.92% <ø> (-0.10%) ⬇️
plugins-test-and-upstream-rhea 44.40% <ø> (-0.10%) ⬇️
plugins-undici 39.36% <ø> (-0.09%) ⬇️
plugins-url 26.73% <ø> (ø)
plugins-valkey 38.31% <ø> (-0.10%) ⬇️
plugins-vm 26.73% <ø> (ø)
plugins-winston 34.26% <ø> (-0.10%) ⬇️
plugins-ws 42.12% <ø> (-0.10%) ⬇️
profiling-macos 40.65% <ø> (-0.10%) ⬇️
profiling-ubuntu 40.78% <ø> (-0.10%) ⬇️
profiling-windows 42.30% <ø> (-0.10%) ⬇️
serverless-azure-functions-client 25.62% <ø> (ø)
serverless-azure-functions-eventhubs 25.62% <ø> (ø)
serverless-azure-functions-servicebus 25.62% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@datadog-prod-us1-6
Copy link
Copy Markdown

datadog-prod-us1-6 bot commented Apr 2, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 85.22%
Overall Coverage: 68.88% (+0.17%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: f2c09cc | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

juan-fernandez and others added 4 commits April 2, 2026 20:47
- waitForCache now removes the stale lock file before falling back to a
  direct fetch, so subsequent processes can re-use the deduplication path
- 500-response tests now provide two nock replies to account for the
  request module's built-in 5xx retry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fetches

The lock owner now touches the lock file every 30s so waiters can
distinguish a slow-but-healthy pagination (e.g. 200k tests over many
pages with 20s timeouts + retries) from a crashed owner. Without this,
a fetch exceeding 2 minutes would be misclassified as stale, causing
waiters to break the lock and fetch concurrently — defeating the
deduplication on exactly the large payloads it exists to protect.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Integration tests reuse the same mock server port within a file with
different known tests data per test case. The cache would return stale
data from a previous test, causing timeouts.

Add DD_CIVISIBILITY_KNOWN_TESTS_CACHE_DISABLED env var (default false)
and set it in getCiVisAgentlessConfig/getCiVisEvpProxyConfig helpers.
When set, getKnownTests bypasses cache entirely and fetches directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…KNOWN_TESTS_CACHE_ENABLED

Rename env var and flip the default: cache is now off unless explicitly
enabled. This avoids interference with integration tests (no env var
needed in test helpers) while letting monorepo users opt in for the
deduplication benefit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Apr 6, 2026

Benchmarks

Benchmark execution time: 2026-04-06 15:05:16

Comparing candidate commit f2c09cc in PR branch juan-fernandez/known-tests-fs-cache with baseline commit ce653ab in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 234 metrics, 26 unstable metrics.

…uests

Extract cache infrastructure into packages/dd-trace/src/ci-visibility/
requests/fs-cache.js with a reusable withCache() wrapper. Apply caching
to getKnownTests, getSkippableSuites, and getTestManagementTests.

All three are behind a single opt-in flag:
DD_EXPERIMENTAL_TEST_REQUESTS_FS_CACHE (default false).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@juan-fernandez juan-fernandez changed the title [test optimization] Add filesystem cache for known tests requests [test optimization] Add filesystem cache for test optimization API requests Apr 6, 2026
juan-fernandez and others added 8 commits April 6, 2026 12:21
…cal position

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The env var is a string — '!!value' treats 'false' and '0' as truthy.
Use isTrue() which correctly handles 'true'/'1' vs 'false'/'0'.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
readFromCache now rejects entries where the data field is undefined or
null. This prevents stale cache files written with an older format
(e.g. { timestamp, knownTests } instead of { timestamp, data }) from
being treated as valid cache hits that return undefined data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
parts.join('|') is collision-prone: fields containing '|' or undefined
values collapsing with '' can produce identical hashes for different
inputs. JSON.stringify(parts) preserves array structure and
distinguishes undefined from '' and objects from their string form.

Remove redundant JSON.stringify(custom) at call sites since the
top-level JSON.stringify handles it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exercise the cached-return path for getSkippableSuites (including
correlationId unwrap) and getTestManagementTests. Each test file
verifies: fetch + callback shape, cache hit on second call, and
lock cleanup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eness

The fixed 2-minute deadline caused waiters to fall back to direct fetch
even when the lock owner was still alive (heartbeat fresh). This
defeated deduplication for slow paginated fetches that exceed 2 minutes.

Now waiters only fall back when isLockStale() returns true (lock file
timestamp older than 2 minutes without heartbeat update), which
correctly distinguishes a crashed owner from a slow healthy one.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
writeFileSync truncates the file before writing, creating a brief
window where the lock file is empty. Waiters polling at that moment
read Number('') = 0, compute Date.now() - 0 > 120000 = true, and
misclassify the lock as stale — breaking deduplication.

Use temp file + rename (same pattern as writeToCache) so readers
always see either the old timestamp or the new one, never empty.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- mergeKnownTests is only used internally, no need to export
- integration test helpers don't need changes since cache is opt-in

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@juan-fernandez juan-fernandez marked this pull request as ready for review April 6, 2026 14:48
@juan-fernandez juan-fernandez requested a review from a team as a code owner April 6, 2026 14:48
@juan-fernandez juan-fernandez requested review from ida613 and removed request for a team April 6, 2026 14:48
@juan-fernandez juan-fernandez merged commit e2b2bae into master Apr 6, 2026
788 checks passed
@juan-fernandez juan-fernandez deleted the juan-fernandez/known-tests-fs-cache branch April 6, 2026 15:32
dd-octo-sts bot pushed a commit that referenced this pull request Apr 6, 2026
@dd-octo-sts dd-octo-sts bot mentioned this pull request Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants