-
Notifications
You must be signed in to change notification settings - Fork 1.3k
BGP evpn testing and bug fixes related to non default EVPN backbone #18358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ebeb641
to
8166406
Compare
ci:rerun |
48aba29
to
c5f826f
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
I introduce some testing of EVPN VRF leaking in #18309 with a completely new topology, does it make sense to merge them? |
Hi @chdxD1 , All contributors own that test, so as a contributor, you can choose what you prefer.
|
c5f826f
to
55c6173
Compare
The description here says "testing", but ... there's a pretty fundamental change to bgp in here also? Has there been discussion about the possible impact of those bgp changes? |
@@ -3464,12 +3464,6 @@ static struct bgp *bgp_create(as_t *as, const char *name, | |||
name, bgp->as_pretty); | |||
} | |||
|
|||
/* Default the EVPN VRF to the default one */ | |||
if (inst_type == BGP_INSTANCE_TYPE_DEFAULT && !bgp_master.bgp_evpn) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a significant change: why is it necessary? if the explicit config in some other instance works, then this is benign. and if the config change doesn't work ... then that's would be something to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjstapp , without this fix, the order of BGP instance configs matters. And this is an issue of having this lock.
About removing that, please find below the discussion I had with @Tuetuopay:
Ca va trigger pour tout le monde un souci que j'ai eu en prod et que j'ai commencé à fix.
Un RR EVPN n'est pas vtep (donc bm->bgp_evpn n'a pas de raison d'être set).
Sauf que dans le cas d'un rr, frr tente quand même d'import les routes, ne trouve pas de vrf evpn, et log en boucle
bgpd[781] [J51AF-GQ2HJ][EC 33554468] vrf import rt lookup - evpn instance not created yet
actuellement ça ne se produit que si
-
- evpn est dans une vrf non-default et
-
- frr n'a pas de default vrf de conf.
ce patch (ne pas mettre default dans bm->bgp_evpn) est à combiner avec ce fix de logging. (sur ma prod j'ai ce message des centaines de fois par seconde)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so ... that exchange isn't very helpful to me, since it's in French?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for translation, the explanation is the following one:
"
It was a RR EVPN setup made for production. Initially, there were a lot of log messages flooded.
(the ones I propose to remove). Those messages were displayed because there was no backbone defined, because 'advertise-all-vni' command was not used.
This control was a workaround to avoid having log messages.
So a default evpn backbone has been defined.
Those log message are displayed on two cases:
- evpn is in a non default bgp instance
- default bgp instance is not configured
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me chime in (in English, this time).
The issue at hand is setting bm->bgp_evpn
aka bgp_master.bgp_evpn
blindly when creating a non-vrf BGP instance, instead of setting it only when FRR is a VTEP (i.e. advertise-all-vni
enabled). It makes the whole behavior inconsistent between having a default instance and not having one.
Say, for example, we make a route-reflector out of FRR using only the non-default VRF. Such a setup does not need advertise-all-vni
, and it would be detrimental as it would increase the processing done for every route passing through FRR. As a matter of fact, there are a lot of stuff that FRR tries to do when bm->bgp_evpn
is not NULL, which will never succeed anyways if advertise-all-vni
is not enabled.
The issue I describe in the French extract above is about log messages that Philippe also disables. The setup described above (route-reflector in a non-default vrf, with no default vrf instance) will log the bgpd[781] [J51AF-GQ2HJ][EC 33554468] vrf import rt lookup - evpn instance not created yet
message for every single route received by FRR. In production, I have EVPN route-reflectors that received over 250 EVPN routes/s, which translates to this log message flooding everywhere.
The proper behavior overall is to set this field when enabling the advertise-all-vni
knob, and only then. This will save work overall because my setup shows that quite a bit of the EVPN code is guarded with if (bm->bgp_evpn)
, and that having it set to NULL is not detrimental if the speaker is not a VTEP. As a testament to the stability of such a change: overall I have more than 50 RRs in production with this setup, and they've been rock-solid.
Changed the title from testing to "testing and bug fixing". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to get more opinions about the change in bgpd.c, and about the change and logging changes in bgp_evpn.c
improved commit log information. |
f1868fc
to
a4e8ea2
Compare
rebased |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
a4e8ea2
to
ed808e3
Compare
264193e
to
b40766b
Compare
c6b5efa
to
6c2f224
Compare
6c2f224
to
7980eca
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
7980eca
to
ecd7b04
Compare
ci:rerun |
… vrf Add a second test in bgp_evpn_rt5 folder that runs the same testing, but by using a non default vrf as evpn backbone for R2. Signed-off-by: Philippe Guibert <[email protected]>
The pylint code style needs to be executed. Signed-off-by: Philippe Guibert <[email protected]>
Seen with bfd_vrf_topo1, and bgp_evpn_rt5 on Ubuntu 22.04 hwe. Do not call ns_delete() from zebra_vrf_delete(), which calls zebra_ns_delete(). - If a netns is removed from the system, vrf_delete()->zebra_vrf_delete() is called before calling ns_delete() (see zebra_ns_notify.c). - If zebra is terminating, zebra_ns_final_shutdown() will call zebra_vrf_delete(). > ==616172==ERROR: AddressSanitizer: heap-use-after-free on address 0x6160000ae3a4 at pc 0x556cdc178d8f bp 0x7ffe4f41ace0 sp 0x7ffe4f41acd0 > READ of size 4 at 0x6160000ae3a4 thread T0 > #0 0x556cdc178d8e in ctx_info_from_zns zebra/zebra_dplane.c:3394 > FRRouting#1 0x556cdc178f55 in dplane_ctx_ns_init zebra/zebra_dplane.c:3410 > FRRouting#2 0x556cdc17b829 in dplane_ctx_nexthop_init zebra/zebra_dplane.c:3759 > FRRouting#3 0x556cdc18095f in dplane_nexthop_update_internal zebra/zebra_dplane.c:4566 > FRRouting#4 0x556cdc1813f1 in dplane_nexthop_delete zebra/zebra_dplane.c:4793 > FRRouting#5 0x556cdc229234 in zebra_nhg_uninstall_kernel zebra/zebra_nhg.c:3484 > FRRouting#6 0x556cdc21f8fe in zebra_nhg_decrement_ref zebra/zebra_nhg.c:1804 > FRRouting#7 0x556cdc24b05a in route_entry_update_nhe zebra/zebra_rib.c:456 > FRRouting#8 0x556cdc255083 in rib_re_nhg_free zebra/zebra_rib.c:2633 > FRRouting#9 0x556cdc25e3bb in rib_unlink zebra/zebra_rib.c:4049 > FRRouting#10 0x556cdc24c9b0 in zebra_rtable_node_cleanup zebra/zebra_rib.c:903 > FRRouting#11 0x7fb25c173144 in route_node_free lib/table.c:75 > FRRouting#12 0x7fb25c17337f in route_table_free lib/table.c:111 > FRRouting#13 0x7fb25c172fe4 in route_table_finish lib/table.c:46 > FRRouting#14 0x556cdc266f62 in zebra_router_free_table zebra/zebra_router.c:191 > FRRouting#15 0x556cdc2673ef in zebra_router_terminate zebra/zebra_router.c:243 > FRRouting#16 0x556cdc10638b in zebra_finalize zebra/main.c:240 > FRRouting#17 0x7fb25c18e012 in event_call lib/event.c:2019 > FRRouting#18 0x7fb25c04afc6 in frr_run lib/libfrr.c:1247 > FRRouting#19 0x556cdc106deb in main zebra/main.c:543 > FRRouting#20 0x7fb25ba29d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 > FRRouting#21 0x7fb25ba29e3f in __libc_start_main_impl ../csu/libc-start.c:392 > FRRouting#22 0x556cdc0c7ed4 in _start (/usr/lib/frr/zebra+0x192ed4) > > 0x6160000ae3a4 is located 36 bytes inside of 592-byte region [0x6160000ae380,0x6160000ae5d0) > freed by thread T0 here: > #0 0x7fb25c6b4537 in __interceptor_free ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:127 > FRRouting#1 0x7fb25c0790e3 in qfree lib/memory.c:131 > FRRouting#2 0x556cdc22d9c9 in zebra_ns_delete zebra/zebra_ns.c:261 > FRRouting#3 0x7fb25c0ac400 in ns_delete lib/netns_linux.c:319 > FRRouting#4 0x556cdc28026a in zebra_vrf_delete zebra/zebra_vrf.c:343 > FRRouting#5 0x7fb25c197443 in vrf_delete lib/vrf.c:282 > FRRouting#6 0x7fb25c1987e8 in vrf_terminate_single lib/vrf.c:601 > FRRouting#7 0x7fb25c197a7a in vrf_iterate lib/vrf.c:394 > FRRouting#8 0x7fb25c198834 in vrf_terminate lib/vrf.c:609 > FRRouting#9 0x556cdc106345 in zebra_finalize zebra/main.c:223 > FRRouting#10 0x7fb25c18e012 in event_call lib/event.c:2019 > FRRouting#11 0x7fb25c04afc6 in frr_run lib/libfrr.c:1247 > FRRouting#12 0x556cdc106deb in main zebra/main.c:543 > FRRouting#13 0x7fb25ba29d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 > > previously allocated by thread T0 here: > #0 0x7fb25c6b4a57 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154 > FRRouting#1 0x7fb25c078f91 in qcalloc lib/memory.c:106 > FRRouting#2 0x556cdc22d6a1 in zebra_ns_new zebra/zebra_ns.c:231 > FRRouting#3 0x556cdc22e30b in zebra_ns_init zebra/zebra_ns.c:429 > FRRouting#4 0x556cdc106cec in main zebra/main.c:480 > FRRouting#5 0x7fb25ba29d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 > > SUMMARY: AddressSanitizer: heap-use-after-free zebra/zebra_dplane.c:3394 in ctx_info_from_zns Signed-off-by: Louis Scalbert <[email protected]> Signed-off-by: Philippe Guibert <[email protected]>
It is not possible to configure an EVPN underlay in a VRF: > ubuntu2204hwe(config)# router bgp 65000 > ubuntu2204hwe(config-router)# bgp router-id 192.168.105.41 > ubuntu2204hwe(config-router)# exit > ubuntu2204hwe(config)# router bgp 65000 vrf r2-vrf-evpn > ubuntu2204hwe(config-router)# neighbor 192.168.100.21 remote-as 65000 > ubuntu2204hwe(config-router)# address-family l2vpn evpn > ubuntu2204hwe(config-router-af)# neighbor 192.168.100.21 activate > ubuntu2204hwe(config-router-af)# advertise-all-vni > % Please unconfigure EVPN in VRF default > ubuntu2204hwe(config-router-af)# advertise-all-vni cannot set bm->bgp_evpn because it was already set when the default instance was configured. Remove setting bm->bgp_evpn at default instance creation. It is done anyway when configuring advertise-all-vni by invoking evpn_set_advertise_all_vni() -> bgp_set_evpn(). From now, bm->bgp_evpn is only modified by bgp_set_evpn(); inside, the bgp_lock() and bgp_unlock() calls are done only when bgp_evpn pointer are modified, and avoid multiple references. Fixes: ("e2f3a930c54c") bgpd: Allow non-default instance to be EVPN one Signed-off-by: Philippe Guibert <[email protected]>
When setting up a RR EVPN setup, the below error message is displayed for each received BGP update. It becomes flooding when many BGP updates are received. > 2025/03/11 14:38:24.959177 BGP: [J51AF-GQ2HJ][EC 33554468] vrf import rt lookup - evpn instance not created yet > 2025/03/11 14:38:24.959188 BGP: [J51AF-GQ2HJ][EC 33554468] vrf import rt lookup - evpn instance not created yet The message is just here to inform that there is no BGP backbone defined, which is normal under a RR configuration. Setting up a RR EVPN configuration should not lead to such message. Fix this by removing the log messages. Signed-off-by: Philippe Guibert <[email protected]>
Add a CE that advertises IP networksk to R2, which will import those prefixes in the L3VPN EVPN network. The import command and the match source-vrf commands are tested. Signed-off-by: Philippe Guibert <[email protected]>
When running 'no advertise-all-vni' command on the evpn bgp instance, then reconfiguring it is not possible: > r2(config)# router bgp 65000 vrf vrf-evpn > r2(config-router)# address-family l2vpn evpn > r2(config-router-af)# > r2(config-router-af)# no advertise-all-vni > r2(config-router-af)# advertise-all-vni > % Please unconfigure EVPN in VRF default > r2(config-router-af)# Actually, the evpn bgp instance is not the default one. Fix this by not setting the bgp evpn instance to the default BGP instance, when removing the 'advertise-all-vni' command. Fixes: e2f3a93 ("bgpd: Allow non-default instance to be EVPN one") Signed-off-by: Philippe Guibert <[email protected]>
Add a test that controls that it is possible to reconfigure a BGP EVPN instance by using the advertise-all-vni command. Negating this command will remove all the EVPN rib entries. Signed-off-by: Philippe Guibert <[email protected]>
ecd7b04
to
4f9995d
Compare
4f9995d
to
f58a727
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Add some more testing: