Skip to content

ComputeDelay called with maxDelay equal to remainingTime returns a '0' delay, leading to spinning #621

@aslatter

Description

@aslatter

I'm trying to use the new "test/synctest" package to test my use of waiters in my own code using AWS SDKs.

My code.

Code under test:

Click to expand
package deleter

import (
	"context"
	"fmt"
	"time"

	"github.com/aws/aws-sdk-go-v2/aws"
	"github.com/aws/aws-sdk-go-v2/service/ec2"
)

type ec2util struct {
	client ec2Client
}

func (e *ec2util) DeleteInstance(ctx context.Context, r string) error {
	_, err := e.client.TerminateInstances(ctx, &ec2.TerminateInstancesInput{
		InstanceIds: []string{r},
		DryRun:      aws.Bool(false),
	})
	if err != nil {
		return fmt.Errorf("terminating instance %s: %w", r, err)
	}

	// Wait for instance termination to complete
	waiter := ec2.NewInstanceTerminatedWaiter(e.client)
	err = waiter.Wait(ctx, &ec2.DescribeInstancesInput{
		InstanceIds: []string{r},
	}, time.Minute*10)
	if err != nil {
		return fmt.Errorf("waiting for instance termination: %w", err)
	}

	return nil
}

type ec2Client interface {
	DescribeInstances(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error)
	TerminateInstances(ctx context.Context, params *ec2.TerminateInstancesInput, optFns ...func(*ec2.Options)) (*ec2.TerminateInstancesOutput, error)
}

Test:

Click to expand
package deleter

import (
	"context"
	"testing"
	"testing/synctest"

	"github.com/aws/aws-sdk-go-v2/aws"
	"github.com/aws/aws-sdk-go-v2/service/ec2"
	ec2types "github.com/aws/aws-sdk-go-v2/service/ec2/types"
)

// TestEC2Instance_Delete_WaiterTimeout verifies that DeleteInstance
// returns an error when the waiter times out using synctest.
func TestEC2Instance_Delete_WaiterTimeout(t *testing.T) {
	synctest.Test(t, func(t *testing.T) {
		fake := &fakeEC2Client{}

		provider := &ec2util{client: fake}
		ctx := t.Context()

		// DeleteResource should fail with timeout - the waiter will poll repeatedly
		// but the fake never returns terminated state, so it times out.
		err := provider.DeleteInstance(ctx, "i-test")
		if err == nil {
			t.Fatal("expected timeout error, got nil")
		}
	})
}

type fakeEC2Client struct {
}

func (f *fakeEC2Client) DescribeInstances(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) {
	// pretend like any instance asked about is in shutting-down state
	var matchingInstances []ec2types.Instance

	for _, id := range params.InstanceIds {
		matchingInstances = append(matchingInstances, ec2types.Instance{
			InstanceId: aws.String(id),
			State: &ec2types.InstanceState{
				Name: ec2types.InstanceStateNameShuttingDown,
			},
		})
	}

	// Wrap instances in a reservation (EC2 API structure)
	var reservations []ec2types.Reservation
	reservations = append(reservations, ec2types.Reservation{
		Instances: matchingInstances,
	})

	return &ec2.DescribeInstancesOutput{
		Reservations: reservations,
	}, nil
}

func (f *fakeEC2Client) TerminateInstances(ctx context.Context, params *ec2.TerminateInstancesInput, optFns ...func(*ec2.Options)) (*ec2.TerminateInstancesOutput, error) {
	// pretend to shut down the instances
	var stateChanges []ec2types.InstanceStateChange

	for _, id := range params.InstanceIds {
		stateChanges = append(stateChanges, ec2types.InstanceStateChange{
			InstanceId: aws.String(id),
			PreviousState: &ec2types.InstanceState{
				Name: ec2types.InstanceStateNameRunning,
			},
			CurrentState: &ec2types.InstanceState{
				Name: ec2types.InstanceStateNameShuttingDown,
			},
		})
	}

	return &ec2.TerminateInstancesOutput{
		TerminatingInstances: stateChanges,
	}, nil
}

The test TestEC2Instance_Delete_WaiterTimeout should fail quickly, as each retry in the waiter advance our fake-clock a little bit at a time.

However this test actually ends up spinning forever.

Inside the waiter, on about the 12th attempt (and every subsequent attempt) in the waiter ComputeDelay starts returning a delay of zero. Since the waiter stops sleeping, time stops advancing, the test sits an spins forever. (The delay within the waiter has a random element to it, so the exact attempt that starts getting stuck is not precise)

Specifically, a call to waiter.ComputeDelay with arguments:

  • attempt: 12
  • minDelay: 15s
  • maxDelay: 120s
  • remainingTime: 15s

Seems to always return a zero delay, when the delay should probably be "15s".

Think the problem is this bit of code:

	// check if this is the last attempt possible and compute delay accordingly
	if remainingTime-delay <= minDelay {
		delay = remainingTime - minDelay
	}

I think this code is trying to make sure to always make room for one more attempt after the current attempt? But if the min-delay is significantly larger than the length of time it takes to perform a retry, once we get near the end of our allotted time-window we will dramatically increase the rate of attempts. I'm not 100% sure here.

In the synctest case, the length of time it takes to perform a retry is zero, so we never make progress once our remainingTime is less-than or equal-to the "minDelay".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions