Skip to content

STNO Spoke VPC Routing Update #105

Open
@Cupidazul

Description

@Cupidazul

Feature request?

We have been using STNO for some time now, its awesome, but only now we detected this behaviour.

STNO v3.1.0 since birth, on our setup we had a --parameter-overrides "DefaultRoute" = "Custom-Destinations" for a default-route 10.0.0.0/8.

We were able to backtrack the function tagging the route change.
"STNOStatus-RouteTable : Route(s) added to the route table."

Cloudwatch logs also show that two route tables were changed:

2023-11-06 12:06:z1,zzz Adding route: 10.0.0.0/8
2023-11-06 12:06:z1,zzz Adding destination : 10.0.0.0/8 to TGW gateway: tgw-xxxxxxxxxxxx into the route table: rtb-yyyyyyyyyyyy
2023-11-06 12:07:z8,zzz Adding route: 10.0.0.0/8
2023-11-06 12:07:z8,zzz Adding destination : 10.0.0.0/8 to TGW gateway: tgw-xxxxxxxxxxxx into the route table: rtb-tttttttttttttttttt

vpc_handler

    def _update_route_table_with_cidr_blocks(self, existing_routes):
        cidr_blocks = convert_string_to_list_with_no_whitespaces(environ.get("CIDR_BLOCKS"))
        if len(cidr_blocks) > 0:
            for route in cidr_blocks:
                self.logger.debug(f"Adding route: {route}")
                self._find_existing_default_route(existing_routes, route)
                self._update_route_table(route)

...
def _update_route_table(self, route):
...
            if (
                self.event.get("Action") == "AddSubnet"
                or self.event.get("Action") == "CreateTgwVpcAttachment"
            ):
                # create route in spoke account route table
                self._create_route(ec2, route)
            elif (
                self.event.get("Action") == "RemoveSubnet"
                or self.event.get("Action") == "DeleteTgwVpcAttachment"
            ):
                # delete route from spoke account route table
                self._delete_route(ec2, route)
...
def _create_route(self, destination):
...
                else:
                    self.spoke_ec2_client.create_route_cidr_block(
                        destination,
                        self.event.get("RouteTableId"),
                        environ.get("TGW_ID"),
                    )
                self._create_tag(
                    self.event.get("RouteTableId"),
                    "RouteTable",
                    "Route(s) added to the route table.",
                )
...
ec2.py

    def create_route_cidr_block(
            self,
            vpc_cidr: str,
            route_table_id: str,
            transit_gateway_id: str
    ) -> CreateRouteResultTypeDef:
        response = self.ec2_client.create_route(
            DestinationCidrBlock=vpc_cidr,
            RouteTableId=route_table_id,
            TransitGatewayId=transit_gateway_id,
        )

The function _update_route_table_with_cidr_blocks to Update the Routing Table Logs "Adding route", then finds a route and updates it? ( _find_existing_default_route + _update_route_table)

The real life result to us was that, our customer had a considerable impact/downtime because he had a default-route configured towards another TGW, and when STNO finished, inside the Action "default_route_crud_operations", we observed that 10.0.0.0/8 route changed to point towards the our newly configured tgw-attachment. This could be worse to roll-back if he didn't know what TGW was there before the change.

This might be doing its thing as coded/expected, but now we are considering changing default behaviour to DefaultRoute="Configure-Manually", just to avoid changing current routes that may exist, still we will loose the automated adding routes where they may be required...

Some thoughts/suggestions come to mind:

  1. In the _update_route_table_with_cidr_blocks, shouldn't the log state "Updating route" instead of "Adding route"? if the function proceeds into a find + update.
  2. There should be some protection/validation before updating an existing route.
  3. Its risky to change routes thru an Automated procedure, may need a bit more controls on these changes, routing rules may differ from customer to customer.
  4. Is the create-route ec2.py updating the route, instead of simply creating it, or is this done somewhere else? RollBack the route to previous TGW should exist if STNO process fails.
  5. Just saw that there is a _delete_route inside the fn _update_route_table, but there was no log informing the deletion of 10.0.0.0/8. And found in the logs more info showing STNO was messing up the VPC routing tables for a while, ever since we did add, delete then add VPC TAGs again. The delete TAGs removed the route which is awful.
  6. fn _find_existing_default_route logged no findings.
  7. fn _update_route_table logged no findings.
  8. In fact, it was observed that no route to 10.0.0.0/8 existed in 3x routing tables on tenants side, but no downtime was detected until a successfull STNO, where a new route was added to each RT on Monday 12h00: "Adding destination : 10.0.0.0/8 to TGW gateway: tgw-xxxxxxxxx into the route table:". No 10.0.0.0/8 existed during the weekend, and no issue was reported, only at 14h00-Monday after adding wrong TGW to routing tables.

We thank you for your thoughts, feed-back or anythings onto helping us is appreciated very much.

Thanks and keep up the good work guys.
Blue

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requesttriagedHas been triaged by solutions team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions