Skip to content

Conversation

@sayerada
Copy link

@sayerada sayerada commented Dec 15, 2025

Problem

After ductile PR #45 was merged, we're seeing intermittent socket timeout errors at exactly 10 seconds for long-running ElasticSearch queries.

Evidence:

  • Some requests succeed: 16s, 24.6s, 21.6s, 28.3s ✅
  • Some requests fail at exactly 10s ❌
  • Intermittent behavior suggests connection pool reuse issue

Root Cause:
The new ductile connection management defaults include a 10-second connection-timeout that appears to be reused as socket-timeout when not explicitly set. This causes failures for ES queries that take longer than 10 seconds (e.g., queries with 1000+ sub-requests that take several minutes).

Solution

Explicitly set timeout parameters when creating ES connections:

  • socket-timeout: 600000ms (10 minutes) - allows long-running queries to complete
  • connection-timeout: 10000ms (10 seconds) - reasonable for establishing connection
  • validate-after-inactivity: 5000ms (5 seconds) - prevents NoHttpResponseException from stale connections

Technical Details

Modified ctia/stores/es/init.clj line 94 to pass explicit timeout configuration to ductile.conn/connect.

This is a temporary workaround until ctia's properties schema is updated to support these new ductile parameters as configurable properties (which would be the proper long-term solution).

Testing

After deployment:

  1. Monitor for 10-second socket timeout errors - should disappear
  2. Verify long-running ES queries (>10s) complete successfully
  3. Check that NoHttpResponseException errors remain fixed (from ductile's validate-after-inactivity)

Related

Follow-up Work

For a proper long-term fix, we should:

  1. Update ctia/properties.clj to add schema for these parameters
  2. Make them configurable via properties files
  3. Remove this hardcoded workaround

Addresses socket timeout errors occurring at exactly 10 seconds for
long-running ElasticSearch queries (e.g., queries with 1000+ sub-requests
that take several minutes).

Root cause: After ductile PR #45 was merged, the new connection management
defaults include a 10-second connection-timeout that is being reused as
socket-timeout when not explicitly set. This causes intermittent failures
for requests that take longer than 10 seconds.

Solution: Explicitly set timeout parameters when creating ES connections:
- socket-timeout: 600000ms (10 minutes) - allows long-running queries
- connection-timeout: 10000ms (10 seconds) - reasonable for establishing connection
- validate-after-inactivity: 5000ms (5 seconds) - prevents NoHttpResponseException

This is a temporary workaround until ctia's properties schema is updated
to support these new ductile parameters (socket-timeout, connection-timeout,
validate-after-inactivity) as configurable properties.

Related:
- ductile PR #45: threatgrid/ductile#45
- Symptom: Requests failing at exactly 10s with socket timeout errors
- Evidence: Some requests succeed at 16s, 24s, 28s while others fail at 10s
User-provided configuration in props should take precedence over
default timeout values. This allows future flexibility if these
parameters are added to the properties schema.
@ereteog
Copy link
Contributor

ereteog commented Jan 5, 2026

default values may be fixed in ductile.
@gbuisson wdyt?

@sayerada
Copy link
Author

sayerada commented Jan 8, 2026

default values may be fixed in ductile.

@ereteog I don't see anything in ductile repo since threatgrid/ductile#45 that sets socket-timeout. The issue is that under the hood, socket-timeout is set to connection-timeout if not set explicitly. The default connection-timeout is set to 10000ms (10s) which is too short for socket-timeout.

@ereteog
Copy link
Contributor

ereteog commented Jan 9, 2026

@sayerada the defaults are currently centralized in this ns: https://github.com/threatgrid/ductile/blob/master/src/ductile/conn.clj. Any other default values can be injected there.
@gbuisson since you started to investigate the optimization of these default values, I think that you may have a proper perspective on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants