Description
Feature request?
We have been using STNO for some time now, its awesome, but only now we detected this behaviour.
STNO v3.1.0 since birth, on our setup we had a --parameter-overrides "DefaultRoute" = "Custom-Destinations" for a default-route 10.0.0.0/8.
We were able to backtrack the function tagging the route change.
"STNOStatus-RouteTable : Route(s) added to the route table."
Cloudwatch logs also show that two route tables were changed:
2023-11-06 12:06:z1,zzz Adding route: 10.0.0.0/8
2023-11-06 12:06:z1,zzz Adding destination : 10.0.0.0/8 to TGW gateway: tgw-xxxxxxxxxxxx into the route table: rtb-yyyyyyyyyyyy
2023-11-06 12:07:z8,zzz Adding route: 10.0.0.0/8
2023-11-06 12:07:z8,zzz Adding destination : 10.0.0.0/8 to TGW gateway: tgw-xxxxxxxxxxxx into the route table: rtb-tttttttttttttttttt
def _update_route_table_with_cidr_blocks(self, existing_routes):
cidr_blocks = convert_string_to_list_with_no_whitespaces(environ.get("CIDR_BLOCKS"))
if len(cidr_blocks) > 0:
for route in cidr_blocks:
self.logger.debug(f"Adding route: {route}")
self._find_existing_default_route(existing_routes, route)
self._update_route_table(route)
...
def _update_route_table(self, route):
...
if (
self.event.get("Action") == "AddSubnet"
or self.event.get("Action") == "CreateTgwVpcAttachment"
):
# create route in spoke account route table
self._create_route(ec2, route)
elif (
self.event.get("Action") == "RemoveSubnet"
or self.event.get("Action") == "DeleteTgwVpcAttachment"
):
# delete route from spoke account route table
self._delete_route(ec2, route)
...
def _create_route(self, destination):
...
else:
self.spoke_ec2_client.create_route_cidr_block(
destination,
self.event.get("RouteTableId"),
environ.get("TGW_ID"),
)
self._create_tag(
self.event.get("RouteTableId"),
"RouteTable",
"Route(s) added to the route table.",
)
...
ec2.py
def create_route_cidr_block(
self,
vpc_cidr: str,
route_table_id: str,
transit_gateway_id: str
) -> CreateRouteResultTypeDef:
response = self.ec2_client.create_route(
DestinationCidrBlock=vpc_cidr,
RouteTableId=route_table_id,
TransitGatewayId=transit_gateway_id,
)
The function _update_route_table_with_cidr_blocks to Update the Routing Table Logs "Adding route", then finds a route and updates it? ( _find_existing_default_route + _update_route_table)
The real life result to us was that, our customer had a considerable impact/downtime because he had a default-route configured towards another TGW, and when STNO finished, inside the Action "default_route_crud_operations", we observed that 10.0.0.0/8 route changed to point towards the our newly configured tgw-attachment. This could be worse to roll-back if he didn't know what TGW was there before the change.
This might be doing its thing as coded/expected, but now we are considering changing default behaviour to DefaultRoute="Configure-Manually", just to avoid changing current routes that may exist, still we will loose the automated adding routes where they may be required...
Some thoughts/suggestions come to mind:
- In the _update_route_table_with_cidr_blocks, shouldn't the log state "Updating route" instead of "Adding route"? if the function proceeds into a find + update.
- There should be some protection/validation before updating an existing route.
- Its risky to change routes thru an Automated procedure, may need a bit more controls on these changes, routing rules may differ from customer to customer.
- Is the create-route ec2.py updating the route, instead of simply creating it, or is this done somewhere else? RollBack the route to previous TGW should exist if STNO process fails.
- Just saw that there is a _delete_route inside the fn _update_route_table, but there was no log informing the deletion of 10.0.0.0/8. And found in the logs more info showing STNO was messing up the VPC routing tables for a while, ever since we did add, delete then add VPC TAGs again. The delete TAGs removed the route which is awful.
- fn _find_existing_default_route logged no findings.
- fn _update_route_table logged no findings.
- In fact, it was observed that no route to 10.0.0.0/8 existed in 3x routing tables on tenants side, but no downtime was detected until a successfull STNO, where a new route was added to each RT on Monday 12h00: "Adding destination : 10.0.0.0/8 to TGW gateway: tgw-xxxxxxxxx into the route table:". No 10.0.0.0/8 existed during the weekend, and no issue was reported, only at 14h00-Monday after adding wrong TGW to routing tables.
We thank you for your thoughts, feed-back or anythings onto helping us is appreciated very much.
Thanks and keep up the good work guys.
Blue