Skip to content

Conversation

@jkppr
Copy link
Collaborator

@jkppr jkppr commented Jan 7, 2026

This PR updates the timesketch-import-client to resolve persistent HTTP 413 Payload Too Large errors when uploading large datasets (CSV, JSONL). This issue arose due to the recent enforcement of MAX_FORM_MEMORY_SIZE (default 200MB) in Werkzeug/Flask, combined with the data expansion caused by URL-encoding large JSON payloads.

Key Changes:

  1. Switch to multipart/form-data:

    • Changed the transport method in _upload_data_frame and _upload_data_buffer from application/x-www-form-urlencoded to multipart/form-data.
    • Reasoning: URL-encoding JSON data increases payload size by approximately 2x-3x. Multipart allows sending the JSON string as a raw file part, eliminating encoding overhead and ensuring the on-wire size matches the actual data size.
  2. Dynamic Recursive Chunking:

    • Implemented logic to pre-calculate the byte size of the JSON payload before transmission.
    • If a chunk exceeds the configured server limit, the client now recursively splits the batch in half until it fits within the limit.
    • This replaces the reliance on a static row count (default 50,000), which was unreliable for datasets containing large individual events.
  3. Configurable Payload Limits:

    • Added a --max-payload-size argument to the CLI (timesketch_importer) and a corresponding setter in ImportStreamer.
    • Defaults to 200MB (matching Werkzeug defaults) but allows administrators to align the client with custom server configurations.
  4. Optimization & Cleanup:

    • Removed legacy time.sleep(2) calls in the upload loop. Retries and backoff are already handled robustly by the urllib3 HTTPAdapter in the API client session.
    • Adding the default MAX_FORM_MEMORY_SIZE value of 200Mb to the app.py to handle situations where the config is not updated.

Impact:

  • Eliminates false-positive 413 errors for chunks that are physically smaller than the server limit but inflated by encoding.
  • Improves upload performance by removing client-side encoding and server-side decoding CPU overhead.
  • Ensures large uploads are robust against memory limits regardless of individual row sizes.

jkppr added 2 commits January 7, 2026 15:04
…k with the new Timesketch form memory limit introduced by Werkzeug.
@jkppr jkppr self-assigned this Jan 7, 2026
@jkppr jkppr added the Data import All things that are with importing data label Jan 7, 2026
@jkppr
Copy link
Collaborator Author

jkppr commented Jan 7, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the Timesketch importer's ability to handle large data uploads by implementing dynamic payload splitting. The ImportStreamer class now defines a DEFAULT_MAX_PAYLOAD_SIZE and a PAYLOAD_SAFETY_BUFFER. The _upload_data_buffer and _upload_data_frame methods were refactored to serialize data, check its size against a calculated safe limit, and recursively split the data into smaller chunks for upload if the limit is exceeded. Data is now sent as a multipart/form-data field. A new set_max_payload_size method was added to allow configuration of the maximum payload size, which is exposed via a new --max-payload-size command-line argument in the timesketch_importer.py tool. Minor changes include import reordering and the removal of a time.sleep workaround.

@jkppr jkppr requested a review from jaegeral January 7, 2026 15:35
Copy link
Collaborator

@jaegeral jaegeral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments / suggestions

@jkppr jkppr requested a review from jaegeral January 7, 2026 17:07
Copy link
Collaborator

@jaegeral jaegeral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jaegeral jaegeral merged commit dc2fe73 into google:master Jan 8, 2026
9 checks passed
@jkppr jkppr deleted the 473875920-importer-chunks branch January 8, 2026 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Data import All things that are with importing data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants