dnsprobe v1

Diagnosing intermittent DNS failures

· ~4 min read · dnsprobe.net/blog

The DNS failure that takes down your site for 100% of users is annoying. The one that takes it down for 4% of users for 90 minutes is the one that ruins your week. By the time you log in, the symptom is gone, the support tickets are vague, and your standard "is the record there" check returns the right answer. Intermittent DNS is a category, not an incident — here is the debug flow that catches it.

1. Differentiate "actual DNS" from "looks like DNS"

Half of reported DNS failures are not DNS at all. Symptoms that masquerade as DNS:

  • App-level connect timeouts that the user attributes to "can't find the server".
  • TLS handshake failures where the browser shows a DNS-shaped error.
  • Local hosts file overrides.
  • VPN split-tunnel decisions that change which resolver is in use.

Step zero is always: get the user to run dig +short failing-domain.com from the affected machine, and tell you the exact resolver IP it used (dig failing-domain.com with no +short shows the SERVER line at the bottom).

2. Compare the authoritative answer against the recursive layer

Authoritative is the source of truth. The recursive layer caches and can diverge. Always check both:

# Authoritative
dig +short failing-domain.com @<your_ns_ip>
# Recursive sample — the 12 majors
dig +short failing-domain.com @1.1.1.1
dig +short failing-domain.com @8.8.8.8
dig +short failing-domain.com @9.9.9.9

For a wider sample without typing the same command twelve times, run the hostname through dnscheck. The "partial propagation" verdict on any record type is the first clue.

3. NS round-robin divergence

You have multiple authoritative nameservers. They are supposed to be in sync. Sometimes they are not — a zone push failed, a secondary's master file got corrupted, a manual edit was made on one but not the others. The recursive resolver picks one NS per query, and 1-in-N queries goes to the broken one.

Check by querying each NS directly:

for ns in $(dig +short NS example.com); do
    echo "=== $ns ==="
    dig +short failing-domain.com @"$ns"
done

Any disagreement here is the bug. Fix the zone sync mechanism.

4. Glue mismatch at the parent

The TLD-level glue records for your zone's NS hostnames must match the A records served by the NS themselves. If they disagree, clients sometimes resolve via glue and sometimes via the zone, hitting different IPs.

dig +trace example.com | grep -E "NS|A"

Look for the NS A records returned in the additional section by the TLD, and the A records returned by the NS query directly. Differences here cause intermittent routing failures.

5. UDP truncation and EDNS misnegotiation

Large DNS responses (a zone with many TXT records, big DNSSEC chains, many MX entries) exceed the 512-byte UDP limit. The resolver should retry over TCP. Some middleboxes drop DNS TCP packets, some firewalls have stale ALG rules that mangle EDNS-padded responses. The symptom is "the lookup works from my workstation but fails from the office router".

Test:

dig +tcp failing-domain.com @1.1.1.1
dig +nobufsize failing-domain.com @1.1.1.1     # force tiny UDP buffer

If the TCP query succeeds but the standard UDP query fails, you have a fragmentation or EDNS issue. Often fixable with smaller responses or by enabling DNS-over-TCP at the recursive resolver.

6. ANAME/ALIAS re-resolution at the authoritative layer

If your apex uses an ANAME-style synthesised record pointing at a CDN, the authoritative server resolves that target hostname on the client's behalf and returns whichever IP the upstream CDN gave it. Two things go wrong:

  • The authoritative server's resolution of the CDN hostname is cached longer than the CDN's intended TTL — so customers get a stale CDN IP for hours after the CDN moved.
  • The authoritative server resolves the CDN hostname from its own location, not from the client's location. So clients in Asia get the European CDN IP because that is what the authoritative server (in Europe) saw.

Both are inherent to ANAME flattening. If your CDN advice is "always CNAME, never A", this is why. The fix is to move the apex to a real CNAME with a provider that supports apex CNAME (Cloudflare via flattening, Route53 via alias, etc.) or to host the apex on a different name and redirect.

7. Resolver-side rate limiting and refusal

Heavy query bursts from a single source can trigger a public recursive resolver's per-source rate limit. You start getting REFUSED responses. The user sees intermittent failures because most of their queries succeed and the rate-limited window is short.

Symptom: dig sometimes returns SERVFAIL or REFUSED from the same resolver IP it just succeeded against. Mitigation: distribute queries across multiple resolvers, run your own recursive locally (unbound, dnsmasq, systemd-resolved cache), back off bursty workloads.

8. Stub-resolver cache staleness

The last suspect is the client itself. macOS, Windows, browsers, Java VMs, every layer caches. After a DNS change, the stub may serve the old IP for 30-60 seconds after the recursive layer caught up.

# macOS
sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder

# Linux (systemd-resolved)
sudo resolvectl flush-caches

# Windows
ipconfig /flushdns

# Java (set in code)
java.security.Security.setProperty("networkaddress.cache.ttl", "60")

A general flow

  1. Get the failing query, the resolver IP, and the exact error.
  2. Repeat the query against the authoritative NS list directly.
  3. Repeat against the major public resolvers (1.1.1.1, 8.8.8.8, 9.9.9.9) and any region-specific ones your users hit.
  4. If 12 resolvers show disagreement, the issue is at the authoritative or recursive layer. If they all agree but the client still fails, the issue is the client stack or path.
  5. Inspect TCP fallback (+tcp), EDNS buffer size (+nobufsize), DNSSEC (+cd).
  6. If everything looks correct, profile the timing — sometimes "intermittent" really means "always slow, fails when timeout fires before answer".

References: dig manpage, RFC 8767 (Serving Stale Data).