Skip to main content

rawops.dev

P2

DNS Resolution Failure — Troubleshooting Guide

Diagnose DNS resolution failures on Linux systems and Kubernetes clusters. Covers resolver config, upstream DNS, systemd-resolved, CoreDNS, and DNSSEC issues.

10 min8 steps
Progress: 0/8 steps
0%

Check if the system can resolve any domain names.

dig google.com +short && echo '---' && dig FAILING_DOMAIN +short
Expected: If google.com resolves but your domain doesn't, the issue is domain-specific (DNS record, propagation, DNSSEC). If neither resolves, it's a resolver issue.

Verify which DNS servers the system is configured to use.

cat /etc/resolv.conf && echo '---' && resolvectl status 2>/dev/null | head -20 || systemd-resolve --status 2>/dev/null | head -20
Expected: Shows nameserver entries. Common issue: pointing to 127.0.0.53 (systemd-resolved) which may not be running, or to an unreachable server.

Bypass local resolver and query public DNS directly to isolate the issue.

dig @8.8.8.8 FAILING_DOMAIN +short && echo '---' && dig @1.1.1.1 FAILING_DOMAIN +short
Expected: If public resolvers work, the problem is with your local DNS config or local resolver. If they also fail, the domain may not exist.

On modern Linux (Ubuntu 18+), systemd-resolved manages DNS. If it's down, resolution breaks.

systemctl status systemd-resolved && echo '---' && resolvectl query FAILING_DOMAIN 2>&1
Expected: Service should be 'active (running)'. If inactive, start it: systemctl start systemd-resolved

DNSSEC errors cause SERVFAIL responses even though the domain exists.

dig FAILING_DOMAIN +dnssec +cd && echo '---' && dig FAILING_DOMAIN +trace | tail -20
Expected: If +cd (checking disabled) returns results but normal query doesn't, it's a DNSSEC validation failure. Check the domain's DNSSEC configuration.

In K8s clusters, CoreDNS handles service discovery and DNS resolution.

kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide && echo '---' && kubectl logs -n kube-system -l k8s-app=kube-dns --tail=20
Expected: CoreDNS pods should be Running. Logs may show upstream errors, SERVFAIL, or loop detection.

Run a DNS test from within the cluster network to check CoreDNS.

kubectl run dns-test --image=busybox:1.36 --restart=Never --rm -it -- nslookup kubernetes.default.svc.cluster.local
Expected: Should resolve to the Kubernetes API server ClusterIP (usually 10.96.0.1). If this fails, CoreDNS or kube-dns service is broken.

Clear cached DNS entries and test resolution again.

# systemd-resolved:
resolvectl flush-caches && resolvectl statistics

# nscd (if running):
nscd -i hosts 2>/dev/null

# Test again:
dig FAILING_DOMAIN +short
Expected: After flushing, stale cache entries are removed. If resolution works now, the issue was a cached negative response.