Inside a microservices cluster, DNS is the universal service discovery. The defaults work until they don't — short TTLs hammer the DNS server, long TTLs delay failover. The fix is the architecture, not the values.

Advertisement

K8s DNS basics

Each service gets a cluster DNS name (svc.namespace.svc.cluster.local). CoreDNS resolves to a service IP (kube-proxy load balances) or to pod IPs for headless services. Default TTL is 5s. Standard for in-cluster service discovery.

TTL trade-offs

Short TTL (5-30s): fast failover, high DNS load. Long TTL (5-30min): low load, but pods serve stale IPs after rotation. Default 5s works for cluster scale; can become a bottleneck at very large scale.

Advertisement

NodeLocal DNSCache

Per-node DNS cache. Reduces load on CoreDNS by 10-100x. Pods talk to the local cache; cache talks to CoreDNS. Standard add-on for any cluster >100 nodes. Run it.

DNS in app code

Some HTTP clients cache DNS for the connection's lifetime (Java's InetAddress cache, default 30s). After a pod IP change, old clients hold stale IPs. Configure client DNS TTL or use a connection pool that respects DNS changes.

Common DNS failures

CoreDNS pod overloaded → high resolution latency → app timeouts that look like network issues. NodeLocal cache OOM → DNS lookup fail → 5xx storm. Monitor DNS resolution latency; many teams don't.

NodeLocal DNSCache + reasonable TTL + client-side cache awareness + DNS latency monitoring. DNS is infra, treat it as such.