Skip to content

BGP evpn testing and bug fixes related to non default EVPN backbone #18358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 5, 2025

Conversation

pguibert6WIND
Copy link
Member

Add some more testing:

  • use non default BGP instance as EVPN instance
  • test source-vrf filtering command

@pguibert6WIND
Copy link
Member Author

ci:rerun

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@chdxD1
Copy link
Contributor

chdxD1 commented Mar 20, 2025

I introduce some testing of EVPN VRF leaking in #18309 with a completely new topology, does it make sense to merge them?

@pguibert6WIND
Copy link
Member Author

I introduce some testing of EVPN VRF leaking in #18309 with a completely new topology, does it make sense to merge them?

Hi @chdxD1 ,

All contributors own that test, so as a contributor, you can choose what you prefer.
Personally, I would opt for 2 or 3, because I feel this setup is becoming a bit messy, now.

  1. Reusing the same python file may be problematic if our requirements diverge.
  2. Having a second python file next to that one in the same folder can be benefit: we have seamless topologies, and can use shared libraries.
  3. Having a completely separate test can be benefit if your topology is very different from that one. This does not prevent from having shared libraries.

@mjstapp
Copy link
Contributor

mjstapp commented Apr 2, 2025

The description here says "testing", but ... there's a pretty fundamental change to bgp in here also? Has there been discussion about the possible impact of those bgp changes?

@@ -3464,12 +3464,6 @@ static struct bgp *bgp_create(as_t *as, const char *name,
name, bgp->as_pretty);
}

/* Default the EVPN VRF to the default one */
if (inst_type == BGP_INSTANCE_TYPE_DEFAULT && !bgp_master.bgp_evpn) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a significant change: why is it necessary? if the explicit config in some other instance works, then this is benign. and if the config change doesn't work ... then that's would be something to fix.

Copy link
Member Author

@pguibert6WIND pguibert6WIND Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjstapp , without this fix, the order of BGP instance configs matters. And this is an issue of having this lock.
About removing that, please find below the discussion I had with @Tuetuopay:

Ca va trigger pour tout le monde un souci que j'ai eu en prod et que j'ai commencé à fix.
Un RR EVPN n'est pas vtep (donc bm->bgp_evpn n'a pas de raison d'être set).
Sauf que dans le cas d'un rr, frr tente quand même d'import les routes, ne trouve pas de vrf evpn, et log en boucle

bgpd[781] [J51AF-GQ2HJ][EC 33554468] vrf import rt lookup - evpn instance not created yet

actuellement ça ne se produit que si

    1. evpn est dans une vrf non-default et
    1. frr n'a pas de default vrf de conf.

ce patch (ne pas mettre default dans bm->bgp_evpn) est à combiner avec ce fix de logging. (sur ma prod j'ai ce message des centaines de fois par seconde)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so ... that exchange isn't very helpful to me, since it's in French?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for translation, the explanation is the following one:
"
It was a RR EVPN setup made for production. Initially, there were a lot of log messages flooded.
(the ones I propose to remove). Those messages were displayed because there was no backbone defined, because 'advertise-all-vni' command was not used.

This control was a workaround to avoid having log messages.
So a default evpn backbone has been defined.

Those log message are displayed on two cases:

  • evpn is in a non default bgp instance
  • default bgp instance is not configured

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me chime in (in English, this time).

The issue at hand is setting bm->bgp_evpn aka bgp_master.bgp_evpn blindly when creating a non-vrf BGP instance, instead of setting it only when FRR is a VTEP (i.e. advertise-all-vni enabled). It makes the whole behavior inconsistent between having a default instance and not having one.

Say, for example, we make a route-reflector out of FRR using only the non-default VRF. Such a setup does not need advertise-all-vni, and it would be detrimental as it would increase the processing done for every route passing through FRR. As a matter of fact, there are a lot of stuff that FRR tries to do when bm->bgp_evpn is not NULL, which will never succeed anyways if advertise-all-vni is not enabled.

The issue I describe in the French extract above is about log messages that Philippe also disables. The setup described above (route-reflector in a non-default vrf, with no default vrf instance) will log the bgpd[781] [J51AF-GQ2HJ][EC 33554468] vrf import rt lookup - evpn instance not created yet message for every single route received by FRR. In production, I have EVPN route-reflectors that received over 250 EVPN routes/s, which translates to this log message flooding everywhere.

The proper behavior overall is to set this field when enabling the advertise-all-vni knob, and only then. This will save work overall because my setup shows that quite a bit of the EVPN code is guarded with if (bm->bgp_evpn), and that having it set to NULL is not detrimental if the speaker is not a VTEP. As a testament to the stability of such a change: overall I have more than 50 RRs in production with this setup, and they've been rock-solid.

@pguibert6WIND pguibert6WIND changed the title Bgp evpn testing related to non default EVPN backbone Bgp evpn testing and bug fixes related to non default EVPN backbone Apr 11, 2025
@pguibert6WIND
Copy link
Member Author

The description here says "testing", but ... there's a pretty fundamental change to bgp in here also? Has there been discussion about the possible impact of those bgp changes?

Changed the title from testing to "testing and bug fixing".

Copy link
Contributor

@mjstapp mjstapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to get more opinions about the change in bgpd.c, and about the change and logging changes in bgp_evpn.c

@pguibert6WIND
Copy link
Member Author

pguibert6WIND commented Apr 16, 2025

I think it'd be good to get more opinions about the change in bgpd.c, and about the change and logging changes in bgp_evpn.c

  • logging changes are not impacting as the user can choose to push its config in whatever order. Anyway, the config is saved.
  • similarly, because the configuration may change, it should be possible to create a default BGP instance which will not be the evpn backbone first. The only impact will be that the EVPN will not be enabled while evpn backbone is not found.

improved commit log information.

@pguibert6WIND pguibert6WIND force-pushed the bgp_evpn_source_vrf branch 3 times, most recently from f1868fc to a4e8ea2 Compare April 22, 2025 12:42
@pguibert6WIND
Copy link
Member Author

rebased

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@pguibert6WIND pguibert6WIND force-pushed the bgp_evpn_source_vrf branch from 264193e to b40766b Compare May 7, 2025 15:01
@github-actions github-actions bot added size/XL and removed size/XXL labels May 7, 2025
@pguibert6WIND pguibert6WIND force-pushed the bgp_evpn_source_vrf branch 2 times, most recently from c6b5efa to 6c2f224 Compare May 9, 2025 07:45
@pguibert6WIND pguibert6WIND force-pushed the bgp_evpn_source_vrf branch from 6c2f224 to 7980eca Compare May 15, 2025 07:55
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@pguibert6WIND
Copy link
Member Author

ci:rerun

pguibert6WIND and others added 8 commits May 21, 2025 17:32
… vrf

Add a second test in bgp_evpn_rt5 folder that runs the same testing, but
by using a non default vrf as evpn backbone for R2.

Signed-off-by: Philippe Guibert <[email protected]>
The pylint code style needs to be executed.

Signed-off-by: Philippe Guibert <[email protected]>
Seen with bfd_vrf_topo1, and bgp_evpn_rt5 on Ubuntu 22.04 hwe.

Do not call ns_delete() from zebra_vrf_delete(), which calls
zebra_ns_delete().

- If a netns is removed from the system, vrf_delete()->zebra_vrf_delete()
  is called before calling ns_delete() (see zebra_ns_notify.c).
- If zebra is terminating, zebra_ns_final_shutdown() will call
  zebra_vrf_delete().

> ==616172==ERROR: AddressSanitizer: heap-use-after-free on address 0x6160000ae3a4 at pc 0x556cdc178d8f bp 0x7ffe4f41ace0 sp 0x7ffe4f41acd0
> READ of size 4 at 0x6160000ae3a4 thread T0
>     #0 0x556cdc178d8e in ctx_info_from_zns zebra/zebra_dplane.c:3394
>     FRRouting#1 0x556cdc178f55 in dplane_ctx_ns_init zebra/zebra_dplane.c:3410
>     FRRouting#2 0x556cdc17b829 in dplane_ctx_nexthop_init zebra/zebra_dplane.c:3759
>     FRRouting#3 0x556cdc18095f in dplane_nexthop_update_internal zebra/zebra_dplane.c:4566
>     FRRouting#4 0x556cdc1813f1 in dplane_nexthop_delete zebra/zebra_dplane.c:4793
>     FRRouting#5 0x556cdc229234 in zebra_nhg_uninstall_kernel zebra/zebra_nhg.c:3484
>     FRRouting#6 0x556cdc21f8fe in zebra_nhg_decrement_ref zebra/zebra_nhg.c:1804
>     FRRouting#7 0x556cdc24b05a in route_entry_update_nhe zebra/zebra_rib.c:456
>     FRRouting#8 0x556cdc255083 in rib_re_nhg_free zebra/zebra_rib.c:2633
>     FRRouting#9 0x556cdc25e3bb in rib_unlink zebra/zebra_rib.c:4049
>     FRRouting#10 0x556cdc24c9b0 in zebra_rtable_node_cleanup zebra/zebra_rib.c:903
>     FRRouting#11 0x7fb25c173144 in route_node_free lib/table.c:75
>     FRRouting#12 0x7fb25c17337f in route_table_free lib/table.c:111
>     FRRouting#13 0x7fb25c172fe4 in route_table_finish lib/table.c:46
>     FRRouting#14 0x556cdc266f62 in zebra_router_free_table zebra/zebra_router.c:191
>     FRRouting#15 0x556cdc2673ef in zebra_router_terminate zebra/zebra_router.c:243
>     FRRouting#16 0x556cdc10638b in zebra_finalize zebra/main.c:240
>     FRRouting#17 0x7fb25c18e012 in event_call lib/event.c:2019
>     FRRouting#18 0x7fb25c04afc6 in frr_run lib/libfrr.c:1247
>     FRRouting#19 0x556cdc106deb in main zebra/main.c:543
>     FRRouting#20 0x7fb25ba29d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
>     FRRouting#21 0x7fb25ba29e3f in __libc_start_main_impl ../csu/libc-start.c:392
>     FRRouting#22 0x556cdc0c7ed4 in _start (/usr/lib/frr/zebra+0x192ed4)
>
> 0x6160000ae3a4 is located 36 bytes inside of 592-byte region [0x6160000ae380,0x6160000ae5d0)
> freed by thread T0 here:
>     #0 0x7fb25c6b4537 in __interceptor_free ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:127
>     FRRouting#1 0x7fb25c0790e3 in qfree lib/memory.c:131
>     FRRouting#2 0x556cdc22d9c9 in zebra_ns_delete zebra/zebra_ns.c:261
>     FRRouting#3 0x7fb25c0ac400 in ns_delete lib/netns_linux.c:319
>     FRRouting#4 0x556cdc28026a in zebra_vrf_delete zebra/zebra_vrf.c:343
>     FRRouting#5 0x7fb25c197443 in vrf_delete lib/vrf.c:282
>     FRRouting#6 0x7fb25c1987e8 in vrf_terminate_single lib/vrf.c:601
>     FRRouting#7 0x7fb25c197a7a in vrf_iterate lib/vrf.c:394
>     FRRouting#8 0x7fb25c198834 in vrf_terminate lib/vrf.c:609
>     FRRouting#9 0x556cdc106345 in zebra_finalize zebra/main.c:223
>     FRRouting#10 0x7fb25c18e012 in event_call lib/event.c:2019
>     FRRouting#11 0x7fb25c04afc6 in frr_run lib/libfrr.c:1247
>     FRRouting#12 0x556cdc106deb in main zebra/main.c:543
>     FRRouting#13 0x7fb25ba29d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
>
> previously allocated by thread T0 here:
>     #0 0x7fb25c6b4a57 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154
>     FRRouting#1 0x7fb25c078f91 in qcalloc lib/memory.c:106
>     FRRouting#2 0x556cdc22d6a1 in zebra_ns_new zebra/zebra_ns.c:231
>     FRRouting#3 0x556cdc22e30b in zebra_ns_init zebra/zebra_ns.c:429
>     FRRouting#4 0x556cdc106cec in main zebra/main.c:480
>     FRRouting#5 0x7fb25ba29d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
>
> SUMMARY: AddressSanitizer: heap-use-after-free zebra/zebra_dplane.c:3394 in ctx_info_from_zns

Signed-off-by: Louis Scalbert <[email protected]>
Signed-off-by: Philippe Guibert <[email protected]>
It is not possible to configure an EVPN underlay in a VRF:

> ubuntu2204hwe(config)# router bgp 65000
> ubuntu2204hwe(config-router)#  bgp router-id 192.168.105.41
> ubuntu2204hwe(config-router)# exit
> ubuntu2204hwe(config)# router bgp 65000 vrf r2-vrf-evpn
> ubuntu2204hwe(config-router)#  neighbor 192.168.100.21 remote-as 65000
> ubuntu2204hwe(config-router)#  address-family l2vpn evpn
> ubuntu2204hwe(config-router-af)#   neighbor 192.168.100.21 activate
> ubuntu2204hwe(config-router-af)#   advertise-all-vni
> % Please unconfigure EVPN in VRF default
> ubuntu2204hwe(config-router-af)#

advertise-all-vni cannot set bm->bgp_evpn because it was already
set when the default instance was configured.

Remove setting bm->bgp_evpn at default instance creation. It is done anyway
when configuring advertise-all-vni by invoking evpn_set_advertise_all_vni()
-> bgp_set_evpn().

From now, bm->bgp_evpn is only modified by bgp_set_evpn(); inside, the
bgp_lock() and bgp_unlock() calls are done only when bgp_evpn pointer are
modified, and avoid multiple references.

Fixes: ("e2f3a930c54c") bgpd: Allow non-default instance to be EVPN one
Signed-off-by: Philippe Guibert <[email protected]>
When setting up a RR EVPN setup, the below error message is displayed
for each received BGP update. It becomes flooding when many BGP updates
are received.

> 2025/03/11 14:38:24.959177 BGP: [J51AF-GQ2HJ][EC 33554468] vrf import rt lookup - evpn instance not created yet
> 2025/03/11 14:38:24.959188 BGP: [J51AF-GQ2HJ][EC 33554468] vrf import rt lookup - evpn instance not created yet

The message is just here to inform that there is no BGP backbone
defined, which is normal under a RR configuration. Setting up a RR
EVPN configuration should not lead to such message. Fix this by removing
the log messages.

Signed-off-by: Philippe Guibert <[email protected]>
Add a CE that advertises IP networksk to R2, which will import those
prefixes in the L3VPN EVPN network. The import command and the match
source-vrf commands are tested.

Signed-off-by: Philippe Guibert <[email protected]>
When running 'no advertise-all-vni' command on the evpn bgp instance,
then reconfiguring it is not possible:

> r2(config)# router bgp 65000 vrf vrf-evpn
> r2(config-router)# address-family l2vpn evpn
> r2(config-router-af)#
> r2(config-router-af)# no advertise-all-vni
> r2(config-router-af)# advertise-all-vni
> % Please unconfigure EVPN in VRF default
> r2(config-router-af)#

Actually, the evpn bgp instance is not the default one. Fix this by not
setting the bgp evpn instance to the default BGP instance, when removing
the 'advertise-all-vni' command.

Fixes: e2f3a93 ("bgpd: Allow non-default instance to be EVPN one")
Signed-off-by: Philippe Guibert <[email protected]>
Add a test that controls that it is possible to reconfigure a BGP EVPN
instance by using the advertise-all-vni command. Negating this command
will remove all the EVPN rib entries.

Signed-off-by: Philippe Guibert <[email protected]>
@pguibert6WIND pguibert6WIND force-pushed the bgp_evpn_source_vrf branch from ecd7b04 to 4f9995d Compare May 21, 2025 15:33
@frrbot frrbot bot added the staticd label May 21, 2025
@pguibert6WIND pguibert6WIND force-pushed the bgp_evpn_source_vrf branch from 4f9995d to f58a727 Compare May 21, 2025 15:33
Copy link
Contributor

@louis-6wind louis-6wind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Tuetuopay
Copy link
Contributor

LGTM too

@pguibert6WIND pguibert6WIND changed the title Bgp evpn testing and bug fixes related to non default EVPN backbone BGP evpn testing and bug fixes related to non default EVPN backbone Jun 4, 2025
Copy link
Member

@ton31337 ton31337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ton31337 ton31337 added this to the 10.4 milestone Jun 5, 2025
@ton31337 ton31337 merged commit c33ff6a into FRRouting:master Jun 5, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants