Skip to content

Concurrency bug in http.Protocol when used in conjuction with client.NewHTTPReceiveHandler #1224

@mikhail5555

Description

@mikhail5555

Couple facts

  1. http.Protocol.incoming is an unbuffered channel.
  2. writing to an unbuffered channel without listener is a blocking operation.

What is observed

After running an event consumer for an extended period of time (20hrs) we noticed an increasingly amount of parked goroutines for http requests which was weird, since we had an WriteTimeout set on our net.http.Server.
Image
After further investigation they seemed to be stuck on the following line of code:

p.incoming <- msgErr{msg: m, respFn: fn} // Send to Request

How it happens:

  1. Two concurrent requests arrive (req1 and req2)
  2. Both requests start a go routine in
    go func() {
    ctx := req.Context()
    msg, respFn, err := r.p.Respond(ctx)
    if err != nil {
    cecontext.LoggerFrom(context.TODO()).Debugw("failed to call Respond", zap.Error(err))
    } else if err := r.invoker.Invoke(ctx, msg, respFn); err != nil {
    cecontext.LoggerFrom(context.TODO()).Debugw("failed to call Invoke", zap.Error(err))
    }
    }()
  3. HOWEVER since
    case in, ok := <-p.incoming:
    is an shared channel, it does NOT guarantee that the incoming msgErr is actually associated with the incoming request(context).
  4. This can lead to one of the following scenarios:
    a. goroutine(req1) consumes msgErr for req1, goroutine(req2) consumes msgErr for req2: No problems
    b. goroutine(req2) consumes msgErr for req1, goroutine(req1) consumes msgErr for req2: No problems* (Context will be wrong -> tracing issues and possibly early ctx termination if requests take different time to be consumed)
    c. goroutine(req1) consumes msgErr for req2 before goroutine from req2 started: goroutine created from req2 can be terminated due to the req2.ctx() check, leading to req1 'hanging' since req2 'consumed' two go-routines (breaking the 1:1 relation)

if/when 4.c happens, you wont instantly notice it so far as new events will be coming in, but it will be out of sync as of that point (goroutine(req5) will consume msgErr from req4, and will have to wait for req6 for its own msgErr to be consumed).

I took the liberty to generate a small graph to better demonstrate this bug since its pretty difficult to explain.
Image

Suggested fixes

  1. Remove the goroutine and inline the request consumption
  2. Use a dedicated channel per request
  3. Include the req.Ctx in the msgErr, and always maintain an 1:1 relation between go-routines and Protocol.ServeHTTP calls.
  4. (Hotfix for not blocking requests) make the sending to the channel non-blocking
	select {
		case p.incoming <- msgErr{msg: m, respFn: fn}:
			// Send to Request
		case <-time.After(10*time.Second):
			http.Error(rw, fmt.Sprintf("Cannot forward CloudEvent: %s", finishErr), http.StatusInternalServerError)
			return
	}

Please let me know if I explain it clear enough or any extra information is required!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions