Skip to content

Support for true streaming of large file uploads #10845

@agionoja

Description

@agionoja

Motivation

The @remix-run/multipart-parser is exceptionally fast and efficient for small-to-medium sized payloads. However, its current implementation buffers the entirety of each file part into memory before yielding it to the consumer. This behavior prevents what is arguably the most critical use case for a streaming parser: handling large file uploads in a memory-constrained environment.

This issue proposes introducing a true, end-to-end streaming API for file parts to make the parser robust for all use cases and align its implementation with the "Memory Efficient" promise in the README.

The Core Issue: Unbounded Memory Buffering

In a real-world test on a system with 16GB of RAM, the current buffering behavior becomes a critical bottleneck:

  • Uploading a 1GB file caused the process's memory usage to spike to over 1GB.
  • Attempting to upload a 2.5GB file exhausted all available system memory, crashing the process.
  • In contrast, a library like busboy on the same system handled a 20GB file upload with a stable memory footprint of ~700MB.

The current API design encourages this memory-intensive pattern, as the entire file's content is loaded into part.bytes before it can be processed:

for await (let part of parseMultipartRequest(request)) {
  if (part.isFile) {
    // By the time this loop yields a `part`, its entire content is already
    // buffered in `part.bytes`, causing memory usage to spike to the size of the file.
    await saveFile(part.filename, part.bytes);
  }
}

This effectively turns a streaming transport layer into a buffered-per-part implementation at the application layer, negating the benefits of streaming for large files.

Steps to Reproduce

The memory exhaustion issue can be reliably reproduced using the bun-large-file demo within this repository.

  1. Clone the repository and navigate to the demo:

    git clone https://github.com/remix-run/remix.git
    cd remix/packages/multipart-parser/demos/bun-large-file
  2. Use a minimal server to isolate the issue: Ensure server.ts uses the standard parseMultipartRequest and accesses part.bytes.

    // packages/multipart-parser/demos/bun-large-file/server.ts
    import { parseMultipartRequest } from '@remix-run/multipart-parser'
    import * as fs from 'fs/promises'
    import * as path from 'path'
    
    const UPLOAD_DIR = path.resolve(__dirname, 'uploads')
    await fs.mkdir(UPLOAD_DIR, { recursive: true })
    
    Bun.serve({
      port: 3001,
      maxRequestBodySize: Infinity,
      async fetch(request) {
        if (request.method === 'POST') {
          try {
            for await (let part of parseMultipartRequest(request,  { maxFileSize:Infinity })) {
              if (part.isFile) {
                const filePath = path.join(UPLOAD_DIR, part.filename!)
                // This line buffers the entire file into memory before writing.
                await fs.writeFile(filePath, part.bytes)
              }
            }
            return new Response('Upload complete', { status: 200 })
          } catch (error) {
            console.error(error)
            return new Response('Error', { status: 500 })
          }
        }
        return new Response('OK')
      },
    })
    console.log('Server listening on http://localhost:3001 ...')
  3. Install dependencies and start the server:

    pnpm install
    bun start
  4. Upload a file larger than available RAM:

    # Create a dummy 3GB file
    dd if=/dev/zero of=large_file.bin bs=1G count=3
    
    # Upload the file
    curl -X POST -F "file=@large_file.bin" http://localhost:3001
  5. Monitor memory usage: Observe the bun process's memory consumption. It will grow linearly with the size of the upload, eventually leading to process or system instability.

Proposed Solution: A True Streaming API for Parts

To address this, the content of a file part should be exposed as a ReadableStream, allowing the consumer to process the file in chunks as they arrive. This keeps memory usage low and constant, regardless of file size.

Proposal 1: Expose a ReadableStream on the MultipartPart

This approach is idiomatic with modern JavaScript and maintains the ergonomic for await...of API.

for await (let part of parseMultipartRequest(request)) {
  if (part.isFile) {
    // Get a stream of the file content
    const stream = part.stream; // or part.contentStream

    // Pipe it directly to a file on disk or a cloud storage service
    await stream.pipeTo(fs.createWriteStream(part.filename));

  } else {
    // Non-file parts can still be buffered as they are typically small
    console.log(part.name, await part.text());
  }
}

Implementation Considerations:

  • To prevent accidental buffering, accessing .bytes or .text() on a part that has had its stream consumed should throw an error.
  • Conversely, accessing .stream after .bytes has been read should yield an empty stream or throw.
  • This new property would only be necessary for file parts (isFile === true).
Proposal 2: An Event-Driven API (like Busboy)

Alternatively, an event-based approach is a well-established pattern for memory-efficient stream processing.

const parser = createStreamingMultipartParser(request);

parser.on('file', (filename, stream, contentType) => {
  // 'stream' is a ReadableStream of the file content
  console.log(`Receiving file: ${filename}`);
  stream.pipeTo(fs.createWriteStream(filename));
});

parser.on('field', (name, value) => {
  console.log(`Received field: ${name} = ${value}`);
});

await parser.done();

This pattern, while a larger departure from the current API, is proven to be highly effective for this use case.

Conclusion

Implementing a true streaming primitive for file parts would solidify @remix-run/multipart-parser's position as a top-tier solution. It would combine its already benchmarked speed with the memory safety required for modern, production-grade applications, making it a clear and compelling choice for all multipart parsing needs.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions