30 January 2024 – A massive outage affecting all national domain zones of Runet drew the attention of experts from around the world to the problem of the resilience of the existing Internet infrastructure, and in particular to the critical technology responsible for its security, such as DNSSEC.
In this article, I would like to summarise the key points of this incident using concrete evidence, and draw some conclusions about the causes of this significant event.
The main reason for this incident was probably the excessive ambitions of the Moscow traffic exchange node (MSK-IX), which is trying to lead the management of the Runet infrastructure with a view to its possible isolation and separation from the global network.
The aim was to control not only traffic exchange nodes, but also national domain zones. This also indicates a desire to test the possibility of controlling the national domain zone as part of gaining full control over the Internet within the country, and is actually the first practical step in creating a “sovereign” Internet.
However, as we can see, the transition of DNSSEC management from the Internet Technical Centre, which has traditionally led the support of national domains, to MSK-IX is not an easy task, and given the exodus of qualified personnel, it is not trivial.
As a result, MSK-IX’s attempt to become the DNSSEC backup centre without sufficient testing and pre-validation resulted in a crash and a complete shutdown of the network.
As a result, rather than sharing responsibility and improving system reliability, MSK-IX’s actions created a second point of failure, once again highlighting the importance of careful planning and testing when making changes to critical infrastructure. More generally, it highlights the shortcomings of current WAN architecture, where the actions of a small node can cause real problems.
What happened in the end? As you will see below, the main problem was errors in the KSK signature,
It is obvious that in the MSK settings the developers did not implement the offline KSK signature that the Technical Centre for Internet (TCI) relies on, and only one KSK was observed in the zone.
As a result, an error occurred during the creation of the .ru zone file. That is, despite the presence of the correct keys, the signatures of the DNS records were incorrect, leading to a failure in domain resolution and a widespread system outage.
This bug caused problems with key rotation and the dual signature mechanism.
In a normal state, the process involved normal KSK and ZSK key rotation, but as a result of the bug, the old ZSK (44301) was returned and became active alongside the existing problematic ZSK (52263), leading to a dual signature rollback scenario. At the same time, there were still issues where signatures created by ZSK (52263) were not correctly verified, indicating a mismatch in the mathematics of the signatures.
Management attempted to remedy the situation by removing the RRSIG for the keyset, resulting in a loss of connection to the root. Eventually, the zone was “repaired” by removing the offending signatures (rollback to earlier in the day).
Analysing this situation, it is possible that the well-functioning legacy DNSSEC procedures in .ru and related zones may have been affected by the experimental changes, possibly in the context of creating a redundant DNSSEC infrastructure, resulting in an additional point of failure.
As a result, this incident revealed a lack of transparency in DNSSEC procedures, particularly in the verification of the actual signers of signatures. This is a precedent that should be seriously analysed by all entities responsible for network resilience, not only in Russia.
To illustrate these points, let’s look at two DNS queries made during the outage:
Query to Server 62.76.76.62:
Query to Server 8.8.8.8 (Google DNS):
The incident was eventually resolved after one of the signatures (probably MSK-IX) was disabled, leading to a gradual recovery. This outage highlights the critical need for transparency, coordination and testing in the management of key Internet infrastructures.
The images you provided show a number of DNSKEY records and their associated chain of trust, as well as some DS and RRSIG records (DNSSEC signatures). Here is a general description of what these images typically represent and what might have happened based on common DNSSEC practices:
DNSKEY Records: These are cryptographic keys used by DNSSEC to sign and secure a DNS zone. There are usually two types of keys:
DS Records: These are Delegation Signer records that hold a hash of the KSK and are used to establish a chain of trust from the parent zone to the child zone.
RRSIG Records: These are signatures that correspond to each DNS record, ensuring their authenticity and integrity.
The images likely show the process of key rotation and signature validation. Here’s a step-by-step of what typically occurs:
In the context of the images, if they are showing errors such as validation failures (indicated by warning symbols), this could mean there was a misconfiguration or an issue in the key rollover process. Possible problems include:
These issues can cause resolution problems for the affected domain, which would manifest as an inability to resolve domain names (resulting in SERVFAIL errors) until the configuration is corrected and the correct keys are propagated.
2024-01-30 and earlier – Initial State of DNSSEC Trust Chain:
A valid DNSSEC trust chain is established with a root KSK (Key Signing Key) 20326 via root ZSK 30903.
.ru KSK 43786 signing over a ZSK (Zone Signing Key) 44301, which in turn signs the zone data.
2024-01-30 15:27 – Erroneous Signatures for Zone Records:
Key rotation executed – zone signing key is ZSK 52263 now. A disruption in .ru RRSIGs leading to any .ru subdomain verification fails.
2024-01-30 16:26 – Attempted Recovery of keyset RRSIGs:
Probably the staff thought of a corrupted keyset RRSIG (the keyset is a junction of KSK 43786, ZSK 44301 and ZSK 52263, signed with KSK 43786) and not of zone records made with ZSK 52263. The attempt to deal with a keyset and corresponding RRSIG led to a situation as on Diagram 2 – the root chain of trust broke.
2024-01-30 18:59 – Double signature ZSK rotation (rollback to ZSK 44301)
At a certain moment .ru staff decided to roll back to a stable state with zone signed by ZSK 44301. For a while both new RRSIGs from ZSK 52263 and old from ZSK 44301 are present in a zone file. The signatures made by ZSK 52263 are corrupted. This is a valid state, when the internet resolvers are up to choose either of RRSIGs. The state is on Diagram 3.
2024-01-30 19:07 – Incident resolution
From this moment the .ru DNSSEC state revert to a stable state as it was before the incident: the zone contains KSK43786, ZSK 44301 and ZSK 52263. This keyset is signed by KSK 43786. The state is on Diagram 4.
2024-01-31 14:17 – Second attempt to make ZSK rotation
ZSK key rotation has to be completed, so .ru staff takes a second attempt to succeed. Now the zone RR are signed with ZSK 52263, and the corresponding RRSIGs are valid – Diagram 5.
There was no clear and public accessible postmortem on the incident for the purposes of an internet community be able to trust all the security-related concerns had been taken on account during this incident remediation.
Moreover, the “.RU DNSSEC Policy” and “.RU DNSSEC Practice Statement” documents are not publicly available, so one is not able to freely check, if the DNSSEC ceremonial is conducted as it was stated by the responsible party to IANA and ICANN.
It is obvious, that the problem was related not to a chain of trust, but to RRSIGs themselves, meaning that the key used was a correct one, but the signatures were corrupted.
And as it was already mentioned, the whole procedure on the .RU zone is not clear, and we do not know the whole story on this problematic ZSK 52263 – why have not it been discarded yet and what makes others to trust it’s correct use now.
Opacity in Internet governance is a more serious threat than it appears at first glance. Attempts to create isolated and fully government-controlled segments will undoubtedly lead to an increase in such failures and a loss of trust in tools such as DNSSEC on the part of the Internet community. In general, the current structure of the Internet as we see it is very vulnerable and largely dependent on the human factor. An alternative could be blockchain-based technologies such as those described in this article.