Skip to content

Resolve OOM when reading large logs in webserver #45079

Open
@jason810496

Description

Description

Related context: #44753 (comment)

TL;DR

After conducting some research and implementing a POC, I would like to propose a potential solution. However, this solution requires changes to the airflow.utils.log.file_task_handler.FileTaskHandler. If the solution is accepted, it will necessitate modifications to 10 providers that extend the FileTaskHandler class.

Main Concept for Refactoring

The proposed solution focuses on:

  1. Returning a generator instead of loading the entire file content at once.
  2. Leveraging a heap to merge logs incrementally, rather than sorting entire chunks.

The POC for this refactoring shows a 90% reduction in memory usage with similar processing times!

Experiment Details

  • 830 MB
  • Approximately 8,670,000 lines

Main Root Causes of OOM

  1. _interleave_logs Function in airflow.utils.log.file_task_handler
  • Extends all log strings into the records list.
  • Sorts the entire records list.
  • Yields lines with deduplication.
  1. _read Method in airflow.utils.log.file_task_handler.FileTaskHandler
  • Joins all aggregated logs into a single string using:
    "\n".join(_interleave_logs(all_log_sources))
  1. Methods That Use _read:
    These methods read the entire log content and return it as a string instead of a generator:
    • _read_from_local
    • _read_from_logs_server
    • _read_remote_logs (Implemented by providers)

Proposed Refactoring Solution

The main concept includes:

  • Return a generator for reading log sources (local or external) instead of whole file content as string.
  • Merge logs using K-Way Merge instead of Sorting
    • Since each source of logs is already sorted, merge them incrementally using heapq with streams of logs.
    • Return a stream of the merged result.

Breaking Changes in This Solution

  1. Interface of the read Method in FileTaskHandler:

    • Will now return a generator instead of a string.
  2. Interfaces of read_log_chunks and read_log_stream in TaskLogReader:

    • Adjustments to support the generator-based approach.
  3. Methods That Use _read

    • _read_from_local
    • _read_from_logs_server
    • _read_remote_logs ( there are 10 providers implement this method )

Experimental Environment:

  • Setup: Docker Compose without memory limits.
  • Memory Profiling: memray
  • Log Size: 830 MB, about 8670000 lines

Benchmark Metrics

Summary

Feel free to share any feedback! I believe we should have more discussions before adopting this solution, as it involves breaking changes to the FileTaskHandler interface and requires refactoring in 10 providers as well.

Related issues

#44753

TODO Tasks

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions