Analyzing the KSK Roll

It’s been more than two weeks since the roll of the Key Signing Key (KSK) of the root zone on October 11 2018, and it’s time to look at the data to see what we can learn from the first roll of the root zone’s KSK.

There are a number of reports that have been published, including one from the Root Canary work (http://bit.ly/2PttyHW). This report contains an informative time series plot looking at the Atlas Probes and their view of the KSK RRSIGs (Figure 1). It shows the 48-hour TTL in action, where old RRSIG value of the root zone DNSKEY RRset declines over the 48 hours following the roll, and the corresponding uptake of the new RRSIG value, signed by the incoming key. The SIDN labs report noted that “We did not detect any major issues with resolvers whatsoever.”

Figure 1 – KSK RRSIG values following the Root Zone KSK Roll (SIDN Labs)

The KSK was originally scheduled to roll on October 11th, 2017. The procedure was halted because of the initial analysis of trust anchor data provided by the mechanism defined in RFC 8145. A plot of all of this RFC 8145 data spanning the period from 1 September 2017 until late October 2018 is shown in Figure 2. In September 2017 the small number of reporting resolvers indicated that some 6%-8% of visible resolvers were reporting that that they trusted the old KSK but not the new KSK. As the number of reporting resolvers increased over the ensuring 14 months the percentage of reporting resolvers that were indicating that they remained exclusively locked onto the old KSK rose to 20% of all reporting sources. This number only declined in May 2018. By the 9th October this number had declined to 5%, but oddly enough it rose by 2% on the 11th October at the time of the KSK roll. At the end of October this number is still at 4% of all sources still reporting that they do not trust the new KSK.

Figure 2 – RFC 8185 Trust Anchor Report data (ICANN) (Retrieved 28th October)

So far, we have two data sets, based on Atlas probes and RFC 8145 reports and these two data sets point to very different outcomes for the KSK roll. The indirection of the relationship to reporting sources and measured impact to users points to some interpretation challenges with the RFC 8145 data when attempting to access user impact. The lack of third party reported outages tends to support the SIDN report of “no significant outage.” As noted on the ICANN blog: “ICANN has heard of only two Internet Service Providers (ISPs) who experienced outages around the time of the rollover and who might have been negatively affected by the rollover, but we have not been able to investigate the root cause of their problems yet.” But perhaps this claim deserves further investigation using additional data sources.

At APNIC Labs we were also performing a measurement of the DNS across the KSK roll. Here I’ll look at our measurements and the results we have gathered. Obviously, we are interested in assessing whether our predictions matched what we observed during the roll.

APNIC Labs Measurement

The measurement technique we used was the use of end-user DNS queries embedded in online advertisement. We observed some 4M – 5M ad impressions per day (Table 1).

Date	Measurements
08/10/2018	5,091,293
09/10/2018	5,214,245
10/10/2018	5,322,040
11/10/2018	5,197,238
12/10/2018	5,163,504
13/10/2018	4,881,168
14/10/2018	4,726,317
15/10/2018	5,313,759
16/10/2018	5,256,944
17/10/2018	5,561,328
18/10/2018	5,981,700

Table 1 – Measurements per Day

We used two measurement approaches. The first, a key sentinel measurement, was a detailed analysis of resolver behaviour using a recently defined resolver mechanism that is intended to reveal in the resolver’s responses the trust status of a root zone key for the collective set of resolvers that each user is configured with, and the second was a count of the number of end users who are located behind DNSSEC-validating resolvers. The first measurement is a predictive measurement to attempt to answer the question of what will happen, while the second can be used to estimate the extent of any impact of the KSK roll after the event to answer the question of what happened.

Measuring Resolver Trust State using Key Sentinel Queries

In this measurement exercise we use six individual DNS queries, all with a unique label component to circumvent DNS caching and to ensure that the DNS queries are answered by the experiment’s authoritative DNS servers.

Unsigned DNS label
Validly signed DNS label
Invalidly-signed DNS label
Test KSK – not-KSK2010 (root-key-sentinel-not-ta-19036)
Test KSK – is-KSK2010 (root-key-sentinel-is-ta-19036)
Test KSK – is-KSK2017 (root-key-sentinel-is-ta-20326)

The APNIC Ad-based measurement system is a highly constrained environment. The script that is executed by the user cannot provide a direct way to measure what response the user received as a result of a DNS query. In this case we used the subsequent fetch of a small web object (a 1×1 pixel undisplayed image file) as an indication that the DNS resolution succeeded.

The first three queries are standard DNSSEC-validation capability queries, while the second group of three queries test the resolver trust status. This test uses a special; template for the left most label of a DNSSEC-signed DNS name to be resolved. If the resolver is unaware of the special processing for this left-most label, or if the resolver is not performing DNSSEC validation, or if the query type is neither A nor AAAA, then the query should be handled by the resolver like any other, without any special processing. Otherwise the resolver will process these queries as follows:

For query 4, if a DNSSEC validating resolver is aware of the root-key-sentinel label processing specification then the resolver will return the validated response only if the key with the hash tag value of 19036 is not a locally trusted key for the root zone. This key tag value corresponds to KSK-2010, the old key. Otherwise the resolver will return SERVFAIL.
For DNS query 5, if a DNSSEC validating resolver is aware of the root-key-sentinel label processing specification then the resolver will return the validated response only if the key with the hash tags value of 19036 is a locally trusted key for the root zone. Otherwise the resolver will return SERVFAIL.
For DNS query 6, if a DNSSEC validating resolver is aware of the root-key-sentinel label processing specification then the resolver will return the validated response only if the key with the hash tags value of 20326 is a locally trusted key for the root zone. Otherwise the resolver will return SERVFAIL.

Details of this key sentinel are in the closing stages of publication as an RFC. The working documents can be found at: https://tools.ietf.org/html/draft-ietf-dnsop-kskroll-sentinel-17.

Categorising Observed Behaviours

Let’s look at the anticipated results which looking at a number of user scenarios. We need to remember that many users use DNS configurations with more than one DNS resolver. A SERVFAIL response from a resolver, which occurs when a validating resolver fails to validate a signed DNS response, will cause the user to repeat the query to the next resolver in their local list, so the states below correspond to the state of the user’s DNS resolution environment irrespective of the number of resolvers that each end-user system has included in its local configuration.

Not-Validating – At least one of the user’s resolvers does not perform DNSSEC validation

In this case we would expect the user to successfully resolve all 6 domain names.
Not-Recognised – All of the user’s resolvers perform DNSSEC validation, and at least one resolver does not recognise the key sentinel label

In this case we would expect the user to successfully resolve URLs 1, 2, 4, 5 and 6. Only URL 3 should be unable to be resolved, as DNSSEC validating resolvers should not resolve this domain name.
Ready – All of the user’s resolvers perform DNSSEC validation, all recognise the key sentinel label, and at least one has loaded KSK-2017

In this case we would expect the user to successfully resolve URLs 1, 2, 5 and 6. We would expect all validating resolvers to have KSK-2010 as a trusted key, so all resolvers should return SERVFAIL for URL 4, and as at least one resolver has loaded KSK-2017, then the user should be able to resolve URL 5.
Not-Ready – All of the user’s resolvers perform DNSSEC validation, all recognise the key sentinel label, and none have loaded KSK-2017

In this case we would expect the user to successfully resolve URLs 1, 2, and 6.

We use web logs to show if the user has managed to resolve a DNS name, by inferring success when the corresponding web object is retrieved.

In almost all cases the script will be used in an environment of using HTTPS to retrieve the web object, and a successful retrieval requires that the web server presents a valid TLS certificate to the user script. As the key sentinel label is a fixed label in the left-most position in the DNS name and the unique name part is in the penultimate label, is not easy to manufacture a valid TLS certificate for the root key sentinel tests. Here we used the Server Name Indication field in the TLS handshake as an adequate confirmation that the user attempted to download the web object.

We can distinguish between Not-Validating and all other cases by ensuring that in all other cases the user’s resolvers have been observed to query for the DNSKEY and DS RRsets that are consistent with DNSSEC validation for all five DNSSEC-signed DNS names.

Measurement Results

The results are shown in Table 2.

Date	Total	Not-Validating	Not-Recognised	Ready	Not-Ready
8/10/18	5,094,293	4,416,890	653,350	22,805	1,248
9/10/18	5,214,245	4,477,571	711,704	24,151	819
10/10/18	5,322,040	4,534,814	760,679	25,600	947
11/10/18	5,197,238	4,446,980	724,735	24,571	952
12/10/18	5,163,504	4,417,264	720,315	25,260	665
13/10/18	4,881,168	4,164,539	691,373	24,750	506
14/10/18	4,726,317	4,084,658	618,566	22,473	620
15/10/18	5,313,759	4,534,385	759,077	19,898	399
16/10/18	5,256,944	4,491,417	743,380	21,718	429
17/10/18	5,561,328	4,729,815	804,913	26,241	359
18/10/18	5,981,700	5,066,865	883,557	31,012	266

Table 2 – Key Sentinel Measurements per Day

Let’s split the results into three sections based on the KSK roll timing:

Before: refers to the three-day period from the start of the 8th October to the end of the 10th October (all times are in UTC).
During: refers to the five-day period from the start of the 11th October to the end of the 15th October. During this five-day period DNSSEC-validating resolvers were in the process of aging the root zone DNSKEY record from their local caches. Trust in the incoming root zone DNSKEY record relies on the resolver having trust in KSK-2017 once the cache entry expired and the resolver refreshed its cache by querying the root zone service system.
After: refers to the three-day period from the start of the 16th October to the end of the 18th October.

The average measurements for each of these categories is shown in in Table 3.

	Not-Validating	Not-Recognised	Ready	Not-Ready
Before	85.917%	13.600%	0.464%	0.019%
During	85.625%	13.899%	0.463%	0.012%
After	85.048%	14.475%	0.470%	0.006%

Table 3 – Proportional Key Sentinel Measurements per Day

What does the theory predict? After the KSK roll has completed no user should be reporting results that indicate they are in Not-Ready. Any user that sits behind a set of DNSSEC-validating resolvers that only trust KSK-2010 will have no DNS resolution service after the KSK roll and will be invisible to this particular measurement system. The residual levels of users reporting Not-Ready in the “After” section is part of the noise component of the experiment. This suggests that the level of uncertainty in measuring Cases C and D is ± 0.01%.

The next point to note is that there are relatively few resolvers that have implemented this key sentinel mechanism. Of the approximately 14% of users who sit behind DNSSEC-validating resolvers only a little over 3% of these users are using DNS resolvers that recognised the key sentinel mechanism at the time of the KSK roll. We are stretching the limits of experimental uncertainty here when the signal of the trusted key status is only visible to users at a rate of less than 5 per thousand across the entire Internet.

It is possible that a small number of resolvers may have stopped performing DNSSEC validation during the KSK roll. Comparing the Before and After numbers in Not-Validating then the number of users behind resolvers that do not all perform DNSSEC validation has risen by 0.9%. If this is indeed the case, then presumably this is due to a number of resolvers switching off DNSSEC validation during the KSK roll. The number of users behind resolvers who appeared to be ready for the KSK roll have increased very slightly, though it is hard to ascribe much significance to an improvement at a level of 0.006% when comparing the Before and After measurements in this form of experiment.

Of the users who are using resolvers that report their key status, the relative number of users who were reporting that they trusted KSK-2017 rose from 96% to 99%. A DNSSEC-validating resolver that only trusts KSK-2010 will be unable to answer any queries once its cached value of the old root zone DNSKEY record (signed by KSK-2010) has expired. This implies that the Not-Ready measurements in the After period are an artefact of experimental noise and does not provide any tangible evidence of recalcitrant resolvers. The After measurements point to the observation that at least 1.3% of those users who appear to sit behind key sentinel aware resolvers are receiving a noise signal in the key sentinel test.

The issue with these numbers is that they are limited to looking at users who sit behind DNS resolvers that were updated to include the key sentinel reporting mechanism. As this specification was only stabilised in mid-2018 we are looking at only a set of resolvers that are actively managed by sys admins who are happy to run with the most recent software updates. For many production environments this is not the case, and the software that is deployed in production environments is often deliberately positioned one or two releases behind the current release version in order to maximise stability of the production platform. The resolvers that may not be tracking the KSK roll are resolvers that are not managed so assiduously, and it is unlikely that these resolvers would be running the KSK sentinel mechanism.

This is a predictive exercise and was useful to some extent in predicting an outcome of the KSK roll. However, due to its limited deployment, it is not very useful as a tool to assess the impact of the KSK roll. Let’s look at the other measurement series, namely the measurement of DNSSEC validation capability, and see if this can shed any light on the KSK roll status.

Measuring DNSSEC Capability

Table 3 points to an interesting result that appears to be the opposite of expectations, namely that the number of users behind DNSSEC-validating resolvers appeared to increase by almost 1% when comparing the Before and After categories.

What’s going on?

Perhaps we can start with one widely reported case of KSK-roll issues, which was reported from the Irish ISP EIR. While the exact nature of the outage was not reported at the time, the timing of the outage and the nature of the issue, namely a DNS problem, points to a KSK-related problem (Figure 3).

Figure 3 – Reports of EIR DNS outage

The DNSSEC test data for the same period for EIR’s AS, AS5466, is shown in Figure 4.

I should note that the scale in Figure 4 is different from the corresponding values in Table 4. Figure 4 shows the “raw” counts as seen by the analysis of ad presentations. The underlying Ad presentation network does not present these ads in a uniform manner across the entire Internet, and the ad placement program tends to over sample in some cases and under sample in others. We use the published figures of Internet users per country to perform a subsequent weighting of these ad presentation numbers in order to get closer to a weighted number that is comparable across countries and across networks. This weighted value is used in Table 4.

Figure 4 – EIR (ASN 5466) DNSSEC data

When a DNSSEC-validating resolver has the wrong trust anchor then it is unable to resolve any name, whether or not the name is DNSSEC-signed. This means that any users behind such a resolver effectively have no DNS service which implies that they have a very limited Internet service, including the ability to receive Ads. As shown in Figure 4, EIR received less than 50% of the normal ad volume on the 13th October. Were others affected as well? Figure 5 shows the total sample count for the period around the KSK roll.

Figure 5 – Ad Sample Count data

There is a dip in the ad count in the period of the 13th October to the 14th October in this data, showing a 12% decline in ad presentations in this period. While it would be highly presumptive to attribute all of this 12% drop in ad presentations to the KSK roll, as the underlying ad presentation rate often varies by a similar amount from day to day, it may be the case that the KSK roll has had some impact here. How can we identify possible networks where this may have been the case?

If we take AS5466 as an example, then we can design a filter to look for impacted networks. In this case we will look for candidate networks which have an average weighted sample count of at least 400 samples per day in the three-day period 9 – 11 October, and where the count DNSSEC-validating users in that network is at least 30% of the sample count. In other words, in setting these values we are looking for networks that have a reasonable sample count so that the noise component can be contained, and a sufficiently high DNSSEC validation rate that implies that we are likely to be looking at networks where validation is provided by the ISP rather than by individual users redirecting their queries to some other DNS resolution service.

There are 233 such candidate networks (by unique AS number) that meet these criteria, out of a total of 42,732 that are seen within the overall ad placement framework over this period. These 197 networks cover some 9.8% of the total ad placement volume, so that filter covers a significant pool of the Internet’s user population.

Networks that are classified as “impacted” by the KSK roll were seen as having a drop of DNSSEC-validating users on the 11th or 12th October. The criteria used here is a decline by a minimum of 33% of validating users when compared to the average of the three days immediately prior to the KSK roll. There are 35 networks that were seen to have experienced this drop, and these networks serve some 0.5% of the total seen user population. The total drop in seen users in the 12th and 13th October was some 46% from within this set of 35 networks, which corresponds to an impact level of some 0.24% of all users.

The list of these networks is shown in Table 4. It is noted that we have no direct way of confirming if the dip in visible users in these networks was due to DNS issues associated with the KSK roll or not, but it does provide a broader view of the possible scope of impact of the KSK roll.

Rank	AS	CC	Seen			Validating			AS Name
			Before	During	After	Before	During	After
1	AS2018	ZA	1,858	1,122	1,473	694	220	288	TENET, South Africa
2	AS10396	PR	1,789	1,673	1,988	1,647	276	33	COQUI-NET – DATACOM CARIBE, Puerto Rico
3	AS45773	PK	1,553	388	1,393	606	178	540	HECPERN-AS-PK PERN, Pakistan
4	AS15169	IN	1,271	438	1,286	1,209	438	1,242	GOOGLE – Google LLC, India
5	AS22616	US	1,264	503	1,526	883	377	1,014	ZSCALER- SJC, US
6	AS53813	IN	1,213	689	1,862	1,063	582	1,419	ZSCALER, India
7	AS1916	BR	1,062	94	991	326	37	277	Rede Nacional de Ensino e Pesquisa, Brazil
8	AS9658	PH	931	281	842	440	136	404	ETPI-IDS-AS-AP Eastern Telecoms, Philippines
9	AS37406	SS	888	486	972	582	365	599	RCS, South Sudan
10	AS263327	BR	882	345	438	776	289	359	ONLINE SERVICOS DE TELECOMUNICACOES, Brazil
11	AS17557	PK	835	430	777	431	277	413	Pakistan Telecommunication, Pakistan
12	AS36914	KE	834	476	937	583	354	670	KENET , Kenya
13	AS327687	UG	802	473	834	390	189	332	RENU, Uganda
14	AS680	DE	773	966	1332	268	117	289	DFN Verein zur Foerderung, Germany
15	AS201767	UZ	761	538	729	461	200	371	UZMOBILE, Uzbekistan
16	AS37682	NG	695	401	728	593	274	568	TIZETI, Nigeria
17	AS7470	TH	674	214	507	219	94	182	True Internet, Thailand
18	AS51167	DE	670	378	479	214	78	156	CONTABO, Germany
19	AS15525	PT	600	260	593	287	125	284	MEO-EMPRESAS, Portugal
20	AS14061	GB	594	468	672	260	169	313	DigitalOcean, United Kingdom
21	AS37130	ZA	585	5	464	414	0	260	SITA, South Africa
22	AS30998	NG	583	264	484	192	54	143	NAL, Nigeria
23	AS135407	PK	569	227	457	419	207	344	TES-PL-AS-AP Trans World, Pakistan
24	AS16814	AR	565	235	456	258	120	208	NSS, Argentina
25	AS132335	IN	563	17	30	538	17	23	LeapSwitch Networks, India
26	AS5438	TN	559	532	579	526	171	27	ATI,Tunisia
27	AS5466	IE	547	240	401	419	184	329	EIRCOM, Ireland
28	AS18002	IN	538	467	614	277	176	242	WORLDPHONE-IN AS, India
29	AS37209	NG	532	109	438	269	45	194	HYPERIA, Nigeria
30	AS37100	ZA	454	161	401	168	95	131	SEACOM-AS, South Africa
31	AS5588	CZ	453	175	430	186	102	162	GTSCE GTS Central Europe, Czechia
32	AS1103	NL	446	38	363	189	7	132	SURFnet, The Netherlands
33	AS17563	PK	402	117	359	207	64	199	Nexlinx, Pakistan
34	AS327724	UG	401	120	538	208	103	266	NITA, Uganda
35	AS7590	PK	400	122	329	266	84	224	COMSATS, Pakistan

Table 4 – Proportional Key Sentinel Measurements per day

Of these 35 networks there are 3 networks that appear to have turned off DNSSEC validation during the KSK roll and had not turned validation back on by the 17th October. These are AS10396 Coqui-NET – Datacom Caribe in Puerto Rico, AS 5438, ATI in Tunisia and AS132335 Leapswitch in India.

Evaluating the KSK Roll

The KSK Rollover Design Team report recommended:

Recommendation 16: Rollback of any step in the key roll process should be initiated if the measurement program indicated that a minimum of 0.5% of the estimated Internet end-user population has been negatively impacted by the change 72 hours after each change has been deployed into the root zone.

From the data gathered by an extensive measurement program spanning the KSK roll period, it appears that some 35 networks experienced some form of failure that impacted networks. Of these networks 3 appear to have turned off DNSSEC validation to recover the service, with the other 26 appear to have taken measures to load the KSK-2017 trust anchor and restore service to their users.

The number of users impacted by the KSK roll, using the measurement approach described above, appears to be of the order of some 0.2% to 0.3% of the Internet’s end-user population, which appears to be within the parameter specified by the KSK Roll Design Team.

Performing a KSK roll for the first time was always going to be a challenge. While it’s always hoped that deployed software will faithfully comply to all the relevant standard specifications and DNSSEC-validating resolvers will be in a position to either follow the KSK roll signals as described in RFC5011 or are managed by system admins who are well prepared to make the local configuration changes in time for the KSK roll. The deferral of the original schedule in September 2017 was accompanied by an extensive campaign to spread the message about the KSK roll and alert DNS service operators of the forthcoming changes. The work, predominately carried out by the Office of the CTO within ICANN, needs special mention as without this considerable effort these numbers would probably have been much higher.

When should we roll the KSK again?

Before looking at this question let me stress that this process to roll the KSK is not over yet. The old KSK, KSK-2010 has been replaced, but not revoked. Any DNSSEC-validating resolver that has been configured to trust KSK-2010 will still be doing so today. To complete the process the key needs to be removed from all these resolvers’ local trust anchor caches. Accordingly, KSK-2010 will re-appear in the root zone’s DNSKEY resource record on the 11th January 2019, but will be used as a signing key for the record with the revocation bit set. This entry will remain in the root zone until 22nd March 2019. There is no “hold-down” period, so resolvers should remove this key from their local cache of trust anchors as soon as they see this revoked key state. The extended publication period is a precautionary measure, as most resolvers will perform this key removal in the 48-hour period starting on the 11th January.

The KSK roll is not straightforward and performing it infrequently will always have its elements of surprise and inadvertent errors. There is much to be said for performing this roll annually, if only to promote the use of automated DNS resolver tools that track the KSK state without the need for manual intervention. However, regularly rolling the KSK achieves little in and of itself. We now should look at further measures for the root zone KSK.

The first is the use of an elliptical curve crypto algorithm for the KSK to replace the RSA-based algorithm. This allows the use of smaller DNS responses which reduces the issues associated with larger packets and packet fragmentation.

The second is consideration of the provision of a backup key which could enable some form of KSK roll that does not require a lead time for 12 months or more to use in the root zone. The general model of some form of backup key envisages the introduction of a key into the root zone that is present for an extended period such that could be rolled in as the new KSK with a shorted lead time than is currently accommodated in the current key management processes. One view of the one year hiatus in the installation of KSK-2017 was that KSK-2017 was already in a trusted state by mid-August 2017, and was essentially playing the role of a backup key for the ensuring 14 months.

The third is a review of the DNS trust key state reporting tools. RFC 8145 is a potentially informative signal, but it has a number of majopr weaknesses in terms of its informative value. It needs to be fixed or killed off! The key sentinel effort also needs to be reviewed. The idea of a “special” label imposes a hefty load on every resolver, and the measurement systems are very noise prone. Is there a way to device a sequence of DNS queries where the next query requires the client to have received the prior response? The CNAME concept is a possible candidate for such a measure, but more consideration is required at this stage.

The final measure in this list is consideration of the publication of a KSK chain. When a resolver is fired up with an old configuration its pre-configured KSK value will not match the current key. If the sequence of signed key changes were available, the resolver could find its configured KSK in the chain, then apply the forward rolls as described in the chain to bring itself into synchronisation with the current KSK value. This requires more rigorous analysis to ensure that it does not introduce any new vulnerabilities, but we need some mechanism to allow “old” systems to be brought into synchronisation with the current state without requiring a user to engage in a potentially complex key installation process.

There are probably more lessons to learn from this exercise, but perhaps that’s for a later time.

The bottom line for the 2018 KSK roll is that thanks to extensive preparation the entire process was largely trouble-free.

The KSK has been rolled and the Internet has survived it largely unscathed!

APNIC

Labs

Javascript is disabled