IETF 112

Virtual meetings continue in the IETF, and the latest one was IETF 112 in November. Here’s my notes from some selected working group meetings that caught my attention. These cover some of the topics that are not directly associated with the DNS, as I’ve separately commented on the status of IETF work on the DNS. Here I’ll look at routing security, IPv6 and transport topics.<>/p>

BGP

Security – SIDROPS

There are a few implications from using X.509 certificates as the base of a PKI, and one is an inclination to continue to use the ASN.1 data description formulation. Which is fine except that interpreting ASN.1 is a dark art, and without resorting to a set of odd and entirely unmemorable incantations to invoke the OpenSSL package to perform the decryption of ASN.1 object then the choice is pretty limited. I haven’t yet tried the incantation openssl -eye_of_newt -toe_of_frog but it wouldn’t surprise me if it worked! Ben Maddison has developed a Python tool that performs ASN.1 decoder and data dumper for ASN.1 encoded objects. It’s a useful addition to the RPKI tool set. The code for this tool is on github. Good one Ben!

The session also heard a report from the IETF hackathon, working on an implementation based on the ASPA drafts (AS Provider Authorization) as an aid to detect some forms of AS Path manipulation and mitigate some forms of route leaks (yes, that not exactly a ringing endorsement, but it’s a complex area and even some levels of risk reduction is a Good Thing, so it’s actually better than this somewhat tempered description would suggest!). It’s hard to predict whether ASPA will gather deployment momentum at this stage. It’s my opinion that BGPSEC is never going to be deployed, so if we want even a rudimentary form of AS Path checking then the ASPA work is all that’s around. But the very slow progress with ASPA calls into question whether we are really interested in AS Path checking in the first place. All these technologies struggle with balancing increased operational complexity and fragility against the quantification of promised benefit, and while ASPA appears to have aimed at a less ambitious set of goals and supposedly done so with a simpler framework than BGPSEC, it’s still not clear that ASPA has managed to get over this critical level of deployment resistance.

BGP over QUIC

There was an interesting proposal in the Inter-Domain Routing Working Group (IDR) to use QUIC as the transport protocol for BGP. The motivation is not necessarily to protect the BGP session from eavesdroppers, although QUIC can do that, nor to protect the BGP session from potential hijacker and disruptors, although QUIIC can do that as well. QUIC can perform mutual endpoint identity verification as a part of its TLS functionality, although once more this is not the primary motivation for using QUIC. In this case the main feature of QUIC is the ability to multiplex multiple streams into one session and not encounter the TCP-related issues of head of line blocking and the ability to control individual logical streams without the coarse response of a full session reset.

Using QUIC each AFI/SAFI becomes its own QUIC bidirectional stream, with its own OPEN, update stream and shutdown control, using the BGP MultiStream capability signal on BGP session start. All other aspects of the BGP protocol are unaltered by this change of transport.

The details are in draft-chen-idr-bgp-over-quic-00.html. This looks like a Fine Thing to do for BGP. So let’s do it.

Dynamic Capability for BGP

Yes, the IETF can take its time to get some work through, but sometimes the process is so slow that it could be measured in the rings on tree trunks! This draft, draft-ietf-idr-dynamic-cap, was first submitted as a working group draft in August 2001. Yes, that’s twenty years and three months ago. Yes, it was in hibernation from 2011 to 2021, but in terms of content nothing much changed over that protracted hiatus.

The concept is simple. These days we are all loathe to tear down a BGP session as it causes unacceptable service interruptions. But capability negotiation only occurs in the OPEN message exchange, so it you want to change the session capabilities you need to tear down and restart the BGP session. The work defines a new BGP capability termed “Dynamic Capability”, which would allow the dynamic setting of capabilities over an established BGP session by using a Capabilities Message. This capability would facilitate non-disruptive capability changes by BGP speakers.

This seems like another Fine Thing to do. I have no idea why it’s taken two decades to bubble to the surface, as it’s just one more of those “Why weren’t we doing this from the start?” kind of ideas.

IPv6

There is continued interest in tweaking various parts of the IPv6 design. As well as source routing, at this IETF meeting we heard about proposals to add alternate marking and augmenting the Extension Header set to carry a VPN identification field. Neither of these proposals have gained much traction and engagement in the IPv6 working groups as far as I can tell, and perhaps this is a broader reflection on the lack of a common desire to make further changes to IPv6 at the moment. Further ornamentation of the IPv6 protocol and its packet header does not appear to address novel and otherwise unaddressed vital use cases, but to simply discuss at some considerable length alternate forms of packet ornamentation for their own sake.

It seems that the protocol is being torn between the conflicting desires of one part of this industry that wants to increase the scope and value of the network layer and another part that wants to commoditise and essentially freeze the network layer and focus on the application space as the vanguard of evolutionary change and value creation. The network is now at a size where it has accumulated considerable stasis at the lower layers, and not even SDN has been able to inject much in the way of flexibility back into the network. Most of these activities look to me to have little likely impact for many years to come, if any. But that has not stopped the IETF from continued IPv6 tweaking.

Source Routing

Some topics just never go away, and various forms of source routing is one of these perennial topics in networking research. In a hop-by-hop destination address based forwarding paradigm only the destination address is relevant to the network, and a case can be made that the role of the source of address in the IP packet header is a purely cosmetic one! Any form of reverse signalling from the network is inherently untrusted and with internal forms of packet encapsulation, including MPLS, the entire concept of what is being signalled and to whom when such a reverse signal occurs is called into question. There was a reason why ECN signalling is a forward signal!

Despite this, there have been persistent calls for some form of a source-address routing scheme where the intended path taken by the pack through the network is specified in part or fully by the packet at the point of entry into the network. I presume this is intended to address a perceived need for policy-based network segmentation for policy, traffic engineering or service quality reasons where the packet sender assumes some control over the network’s handling of the packet, or the network itself makes different forwarding decisions based on some assumption about the identity and authority that has been given to the packet’s sender.

The IPv4 packet options, including time stamping and loose and strict source routing, never found much traction outside of highly specialised environments, and they certainly play no role whatsoever in the public IPv4 Internet. The requirements for raw switching speed over all other considerations mandated a very simple packet processing flow where IPv4 packets loaded with such options were either quietly ignored or dropped without further trace.

The IPv6 specification moved these options into the so-called Extension Headers, which in the case of all bar fragmentation controls was a cosmetic relabelling of these IPv4 options. Again, the pragmatic observation is that Extension Headers appear to have a role only in specialised environments, for similar reasons including switching speed and security considerations. But such mundane considerations have never really stopped the IETF’s ability to ornament technology by adding generally pointless detail and complexity that, at best, has a limited role in highly specialised environments, if at all. Yes, if you are a hardware vendor flogging switching equipment, then packing a unit with all this functionality can certainly add to the supposed value being invested into the network infrastructure and provide some justification for upping the price of what otherwise would just be commodity silicon performing a simple utility switching function, but as far as I can see it’s a vapid value proposition in the larger scheme of things!

There is a broader tension at play here as well and it concerns where and how functions are implemented into the networking environment. There is a continuing pressure to add features and functions into the network level. From this perspective the network is the overall platform orchestration component, and the upper-level applications signal requests to the network for specific functions, such as path diversity, explicit path selection and similar. The other side of this tension is for the application to treat the network as a collection of essentially featureless ‘dumb’ pipes that perform destination based forwarding and nothing else. Functionality is loaded into the application and the application is meant to adapt to the network conditions as the application finds it at the time.

The SPRING effort has been the focus of much attention, particularly from the vendors of network equipment. It’s an option question as to whether this attention has resulted in a simpler framework that looks deployable, or whether the addition of more features, behaviours, and service responses adds complexity without much in the way of common benefit in the general use case. SPRING is complex enough that the issues discussed in IETF 112 cannot be readily summarised in a few paragraphs, so I won’t even attempt it here, but on the other hand it’s entirely unclear at this stage if SPRING is a useful tool for the larger Internet or a niche behaviour of limited applicability and relevance. Personally, I tend to the latter view!

IPv6 Fragmentation Reconsidered

One proposal was however quite fundamental in nature, and it was one that attracted little in the way of comment at the 6MAN working group session. This is the proposal to augment IPv6 to permit fragment retransmission.

This is a relatively novel concept for the IP architecture. IPv4 permitted routers to fragment a packet in flight, in order to adapt to the limitations of various networks through which an IP packet may have to pass. It was the task of the destination host to reassemble these IP fragments back into the original IP packet and then pass this reassembled packet payload to the upper transport layer. The destination host uses a reassembly timer, and if the timer expires before all the fragments arrive, then the fragments are discarded. The host may or may not respond with an ICMP fragment reassembly time exceeded message when it does so. At this point the sender needs to resend the entire packet, either in response to the ICMP message or upon expire of some sender timer, assuming that it has kept a copy of the packet and its willing to resend it.

This additional behaviour and the inefficiencies it introduced was part of the justification for the view that was solidifying in the late 1980’s that IP fragmentation should be avoided. This view was part of the reason why the IPv6 specification limited IP fragmentation to a source function. Routers should not be gratuitously fragmenting IPv6 packets. This new work (draft-templin-6man-fragrep) proposes to add a couple of additional fields to the IPv6 fragment header, and a new ICMPv6 message, a fragmentation report, that details the received fragments of an IPv6 packet. The result is that a destination can explicitly signal which fragments have not been received so that the sender can resend just those fragments.

I’m struggling for context here when I consider this in the light of transport protocol behaviours. In UDP, where there is no explicit acknowledgement of packet receipt, the sender immediately discards the packet once sent, so the receipt of a lost fragment report would make no difference to a UDP sender. The sender could hang on to sent UDP packets for some period, but without explicit knowledge of the network delay then the sender is left guessing as to how long this local retention period should be, and the additional requirement to retain sent packets may be a performance constraint for high-speed high-volume transactions. In TCP one wonders what the gain here might be as compared to the combination of path MTU discovery and selective acknowledgement (SACK). The existing TCP mechanisms appear to offer the necessary functionality without the need for new mechanisms associated with fragmented segments.

Extension Headers

The debate over IPv6 Extension Headers continues. There is an undeniable tension between network equipment that wants to perform a basic forwarding function in a minimum number of hardware cycles and proposals to allow IP packets to specify a more complex network response to the packet. The experience with IP options in IPv4 has enjoyed no lasting success in the public Internet. Why should IPv6 be any different?

We’ve gone through a number of iterations in this process. Firstly, there have been efforts to deprecate the Fragmentation Extension Header, which founded. Then there was the notion that we should just avoid using all Extension Headers as they are not well supported on the public Internet. The latest conversation on this topic is narrowing in on the Hop-by-Hop extension headers and advocating deprecating just these Extension Headers. After all, these are the options that expose the network’s infrastructure to potential resource exhaustion attacks. Deprecating these headers on the basis that there seems to be little of value in such HBH options and much that could be used to attack the network appears to make sense. Right? Well, no! Because there is always someone who is using these options and wants to continue to do so. It might not be the public Internet, but the protocol specification is meant to be general purpose, not just for one realm of use.

I must admit here that erring on the side of extreme simplicity and highly limited option set (even to the extent of no options at all!) appeals to my sense of a robust minimal design architecture. Yes, there are then things that the IP layer would find it hard to perform without explicit IP options and signals, but that’s why upper layers exist. Hanging on to cruft in specifications just in case someone might want to use it someday seems to me to be a pretty retrograde position that results in more complex and less robust service outcomes for our networks.

Transport

L4S

The work on standardising Low Latency Low Loss Scalable throughput (L4S) services continues in the Transport Services Working Group and three documents have been through Working Group Last Call. Part of the effort is to perform some re-definition of the Explicit Congestion Notification signal (RFC3168) so that the ECN signal is generated at the start of queue formation rather than at the point where congestion loss is imminent (Figure 1). It’s a slow standardization path for a high-speed architecture, and the working documents have now attained a five-year maturity level in the working group. Unlike BBR, which is a sender-side only approach the L4S approach is attempting to change sender and receiver behaviour as well as change ECN-aware network behaviour. It’s a much more ambitious agenda when everybody needs to make changes. Our track record of making such widespread changes is not exactly impressive in recent years, and I suspect that this effort is on a similar path of deployment failure, irrespective of the obvious merits of a faster network.


Figure 1 – L4S ECN signalling

MP-DCCP using BBRM

DCCP was a hybrid approach of attempting to combine a TCP-like congestion control algorithm with a UDP-based data flow. The anticipated benefit was that volume based USP flows could coexist in a shared network with concurrent TCP flows without collateral damage to either class of traffic. This work, presented in the Internet Congestion Control Research Group (ICCRG) session attempts to take Multi-Path DCCP and use the BBR flow control algorithm as the congestion control algorithm.

It’s an interesting approach and the early results have been promising with lower latency and good traffic distribution across multiple paths according to the individual path capacities. Compared to TCP BBR, the phase of BBR where the sender probes the path capacity and then adjusts its sending rate is not well coupled in DCCP and the probe phase results in a delayed low throughput interval. Changing the post-probe recovery rate for DCCP appeared to address the issue and restore some parity between TCP BBR and MP-DCCP BBR performance.

Of course the obvious question here is: why not use a variant of QUIC without sender retransmission?

The Game Theory of Congestion Control

I was impressed by the performance of BBR when I looked at this control algorithm some years ago, and I wasn’t the only one. Evidently many services now have switched to use BBR, and the current metric is that close to 18% of the Alexa top 20,000 websites are now using BBR. With the introduction of CUBIC in the 2000’s many sites switched to CUBIC for improved performance, and we are now seeing a similar transition to use BBR. The question posed in this presentation to the ICCRG session was to speculate on the next steps.

There are some starting assumptions here, namely that a small number of BBR flows will often achieve higher than a fair share flow rate when competing against CUBIC flows. However, when there are short CUBIC flows in a larger environment of BBR flows, then CUBIC will perform well because of its propensity to use the network queues to improve its throughput.

To the researchers at the University of Singapore this looked like a situation that could be modelled as a Nash Equilibrium and their conclusion was that we will continue to see a mix of CUBIC and BBR in the Internet. If there is a Nash Equilibrium point then at such a point of maximal individual optimality, then if any current CUBIC flow switched to use BBR then all the BBR flows will perform marginally worse than before, and similarly if any BBR flow switched to use CUBIC than all CUBIC flows will fare worse as a consequence.

BBRv2

Work on BBR continues and BBRv2 is now reported to carry all internal production TCP traffic within Google. BBrv2 incorporates ECN signals, loss, bandwidth estimation and Round Trip Time (RTT) measurements to guide the control adjustments. However, there is a subtle caveat to this observation as some of this traffic is now being carried using a variant, they’re calling BBRv2.Swift. This variant includes a slightly different RTT measurement, namely Network_RTT, that excludes receiver-side packet delays. They see this as useful in intra-datacentre environments (which I guess is much of Google’s internal environment).

Their external traffic, which has a large YouTube component, is currently transitioning to BBRv2. The A/B experiments that are being conducted point to lower RTTs for BBRv2 as compared to both BBRv1 and CUBIC, and a reduced packet loss rate that is closer to CUBIC’s levels. Given the observations relating to Game Theory and Nash Equilibrium it’s hard to tell if the dominant factor here is the interaction with other BBR sessions, or the interaction between CUBIC and BBR sessions and the radically different responses of how to cope with network queue contention. I also assume that the ECN behaviours that they assume in BBRv2 is imminent loss signalling as described in RFC3168, and not the queuing onset signalling being defined in the L4S effort. Were the ECN signalling behaviour to change then it looks like BBR would also need to change its interpretation of the signal.

BBRv2 has a different flow profile than the original BBR and rather than a regular pulse and drain behaviour every eighth RTT it performs separate probes for path capacity and RTT, as shown in Figure 2.


Figure 2 – Example Idealized Data Flow in BBRv2 (from BBRv2 presentation at IETF 112

For those wanting to play along at home, BBRv2 is now available as an alpha release.

There is also work on tuning BBRv2 as a QUIC control algorithm. It has involved some issues with a low initial estimate of the bandwidth delay product of the path when packet loss occurs in startup due to a limited congestion window value (cwnd) in the QUIC client. The PROBE-UP phase can exit early due to network queue build up. The fix is to maintain the PrOBE_UP state for at least one minimum RTT and to reach the point where the data in flight in this round is 25% greater than the estimated Bandwidth Delay Product (BDP) of the network path. In other words, any data still in flight from the prior BBR state is not counted in the PROBE-UP interval. Finally, there is a fix for intermittent data flows at source that pushes BBR into an idle state. If this occurs during the PROBE_RTT phase the resumption of data would continue in the suspended RPOBE_RTT state. The fix is to exist this reduced transmission state if the elapsed time is long enough even if no further data has been sent.

TCP Congestion Control

Despite any impressions to the contrary that you might have picked up from this report so far, BBR is not the only show in town, and there is much work still taking place in loss-based TCP congestion control algorithms as well. The material presented in the TCP Maintenance and Minor Extensions (TCPM) is a good window on these related activities.

I’ll avoid going into detail here, as there is much to consider when you lift the lid on congestion control algorithms, but instead provide an overview of a couple of the topics considered in the TCPM Working Group session.

The first topic is revisiting RFC6937, Proportional Rate Reduction (PRR) an experimental specification that proposed replacing the fast recovery and rate halving algorithm. The intention is that PRR accurately regulates the actual flight size through recovery such that at the end of recovery it will be as close as possible to the slow start threshold (ssthresh) value expressed as a segment count, as determined by the congestion control algorithm. The major intention of this revision is to shift this specification from the Experimental to Proposed Standards track, but there are some other details being addressed at the same time. Fast Recovery is the reference algorithm for the sender adjusting its sending rate in response to a signal of packet loss. Its goal is to recover TCP’s self-clocking by relying on returning ACKs during recovery to clock more data into the network. The Strong Packet Conservation Bound on which PRR and both Reduction Bounds are based is patterned after Van Jacobson’s Packet Conservation principle: segments delivered to the receiver are signaled through the ACK stream and this signal is interpreted as the clock to trigger sending the same number of new segments into the network as have left the network. (It should be noted that this algorithm uses segment counting and not byte counting. Yes, that’s an important distinction!) As much as possible, PRR and the Reduction Bound algorithms rely on this self-clocking process and are only slightly affected by the accuracy of other estimators, such as pipe [RFC6675] and congestion window management (cwnd). This is what gives the algorithm its precision in the presence of events that cause uncertainty in other estimators.

CUBIC is a standard TCP congestion control algorithm that uses a cubic function instead of the linear window increase function on the sender side to improve scalability and stability over fast and long- distance networks. It has been in use for more than a decade and has been adopted as the default TCP congestion control algorithm in a number of platforms including Apple, Linux and Windows systems.

CUBIC is designed to behave in a manner similar to TCP Reno in smaller networks (networks with a small Bandwidth-Delay product and networks with a small RTT). CUBIC uses a more aggressive window inflation function for fast and long-distance networks, addressing a critical weakness in the Reno algorithm. The recent changes in this work include the recommendation to use HyStart++ to avoid large scale overshoot in slow start, the possible use of Proportional Rate Reduction in response to packet loss, and the use of the number of segments in flight (FlightSize) instead of the sender’s congestion window (cwnd) to recover after a congestion event, unless the segments in flight has been limited by the application.

IETF 112 Materials

More information on the agendas of the working group sessions, the presentation materials, working group minutes and pointers to recordings of the session can be found at the IETF meeting agenda site. Obviously, there is much I have not touched upon in this brief summary of some of the sessions that were held during the meeting.

IETF 113 is scheduled for 19 – 25 March 2022. The latest update on whether it will be fully online meeting or a mixed mode meeting is planning for a mixed mode meeting. However, It seems that the situation is still somewhat uncertain in the light of further waves of COVID-19 infections that are sweeping the world.