BGP handling of obscure errors
▼ I read Ben Cartwright Cox' (extensive) blog post Grave flaws in BGP Error handling and then saw his talk about the same topic at NLNOG on Youtube.
The necessary background
In addition to the "well-known" BGP path attributes that we all know (because the RFC says we must) and love (because they make the internet work), it's also possible define new attributes to provide new functionality. These can be "transitive" attributes, which means that a BGP router that doesn't recognize them propagates them to its BGP neighbors unchanged.
The ability to create new optional transitive attributes has allowed us to run BGP version 4 for three decades without having to bump the version number because we had to make backward-incompatible changes that would make adoption all but impossible.
For instance, 32-bit autonomous system numbers were added as the 16-bit BGP AS numbers started to run out. In addition to the well-known (mandatory) 16-bit AS path, an optional 32-bit AS path was added. If a router in the middle didn't understand the 32-bit AS path, it would update the 16-bit AS path and propagate the 32-bit AS path unchanged.
The next 32-bit capable BGP router can then add back the AS numbers from the 16-bit path that are missing from the 32-bit path, and 32-bit AS numbers work even if routers in the middle don't understand them. (They just see "23456".)
Of course you can read all about BGP attributes in my book Internet Routing with BGP. It's even in the sample chapters! (Page 15.)
The error handling issue
In Ben's blog post, he talks about a Brazilian network included a malformed version of a still experimental attribute. All the big routers in the core of the internet don't run experiments, so they just saw an attribute they didn't recognize, and propagated it as per the transitive setting. Eventually BGP updates with the broken attribute arrived at routers that did understand the attribute, but saw that it was broken.
So as per the original BGP spec, they tore down the BGP session towards the router that sent them the broken attribute. And then, after a short delay, tried to set up a new BGP session towards that neighboring router. Only to encounter the same error again and tearing down the BGP session again. And so on.
Which is probably not wat you want. Which is nicely explained in RFC 7606, published in 2015, which suggests to treat such errors as if the neighboring router had asked to withdraw the route containing the offending path attribute. So if a neighbor tells me prefixes 10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16 are reachable through them, and 172.16.0.0/12 has a broken attribute, I just act as if my neighbor had told me that 172.16.0.0/12 is not reachable through them. But I don't bring down the BGP session so 10.0.0.0/8 and 192.168.0.0/16 remain reachable through the neighbor in question.
Ben seems to be rather annoyed that many router vendors don't implement the RFC 7606 behavior, implement it but don't enable it by default, and/or don't have a bug bounty program to reward security researchers for pointing out these deficiencies. He spent a good amount of time evaluating different implementations and then "fuzzing" attributes to see what would happen, So that's somewhat understandable. Here is his score card from his presentation slides:
My take
I agree that the RFC 7606 handling by default is what you want. I also agree that changing a default here, something router vendors loathe to do, shouldn't be problematic.
However, these are pretty obscure errors. This is not an internet extinction level issue.
For my own network, I would strongly prefer a mechanism to turn off handling of these often rather frivolous new attributes. Both to avoid being bitten by buggy implementations elsewhere, but also to avoid inflating BGP messages. As BGP updates propagate, the AS paths (the 16- and 32-bit versions) increase in length, so an update that was just under the limit at some point will exceed the maximum size of 4096 bytes at some point, and then definitely bad things will happen.
However, it's important that new transitive attributes aren't filtered out wholesale, as that would make it impossible to add new features to BGP. I'm not sure if there is a workable way to put a stop to frivolous BGP path attributes being injected into the global routing system while at the same time not robbing BGP of its forward compatibility with future new innovations.
Permalink - posted 2023-10-02