node-local-dns-cache DNS i/o timeout errors at startup #648

rahul656 · 2024-10-14T09:53:59Z

We are seeing DNS i/o timeout failures for dns lookups on a newly spun up eks node and 1-2 minutes after node-local-dns daemon set pod has started and serving traffic.

The issue is more prominent in our large clusters

Image version - 1.23.1
eks version - 1.26.15

node-local-dns confg

cluster.local:53 {
        errors
        cache {
                success 9984 61
                denial 9984 61
        }
        reload
        loop
        bind 169.254.20.10 172.20.0.10
        forward . __PILLAR__CLUSTER__DNS__ {
                force_tcp
        }
        prometheus :9253
        health 169.254.20.10:8080
        }
    in-addr.arpa:53 {
        errors
        cache 61
        reload
        loop
        bind 169.254.20.10 172.20.0.10
        forward . __PILLAR__CLUSTER__DNS__ {
                force_tcp
        }
        prometheus :9253
        }
    ip6.arpa:53 {
        errors
        cache 61
        reload
        loop
        bind 169.254.20.10 172.20.0.10
        forward . __PILLAR__CLUSTER__DNS__ {
                force_tcp
        }
        prometheus :9253
        }
    .:53 {
        errors
        cache 61
        reload
        loop
        bind 169.254.20.10 172.20.0.10
        forward . __PILLAR__UPSTREAM__SERVERS__
        prometheus :9253
        }

coedns config

Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        loop
        cache 30
        reload
        loadbalance
    }

node-local-dns error message

[ERROR] plugin/errors: 2 **********. A: read udp 10.176.117.198:37607->10.64.0.2:53: i/o timeout

The upstream server here is AWS Route 53 DNS resolver and i have confirmed through the VPC DNS query logs that Route 53 resolver has returned the response fine with in couple of ms.
Have also confirmed its not the AWS VPC resolver limits of 1000 packets/ seconds from a single ENI. AWS R53 resolver returns the response fine as mentioned.

Simulation -
I have tried the following methods to simulate this in test but failed to do so.

Tried with increased DNS load on a node-local-dns pod.
Tried with an increased DNS load after disabling the node-local-dns cache.
Tried increasing the DNS load and restarting the node-local-dns pods. (graceful as well as force kill)
Tried simulating the xtable lock contention along with the node-local-dns pods restart (graceful as well as force kill)

TCP dump output

From the tcpdump it was noted that for approx 3-5 seconds node-local-dns was hung/slow in returning responses to the clients, even though it has received the responses from upstream dns server.

Here is the behaviour i notice from the dump

Pod sends DNS query to Kube-DNS service.
DNS query is routed to node-local-dns pod.
Node-local-dns calls AWS Route53 DNS resolver
AWS Route53 DNS server responds.
node-local-dns pod passes the response to the application container after 1 - 4 seconds of receiving the response from the upstream dns server.

For i/o failures in node-local-dns i see that node-local-dns pod is returning server failure to the clients.

Some response are returned sucessfuly after 1-4 seconds of receiving response from upstream dns server.
The queries for which we see the i/o timeout , we always see that node-local-dns returns server failure to the clients.
The DNS queries from all application pods are impacted during this 3-5 seconds brief window where node-local-dns takes longer to return responses to the application pods / clients.

The text was updated successfully, but these errors were encountered:

shayrybak · 2024-10-22T07:15:11Z

we saw the exact same behavior when trying to upgrade from 1.22.20 to 1.23.1

rahul656 · 2024-10-22T12:15:26Z

Thanks @shayrybak for letting us know. Could you tell me if you rolled back to an older version to fix this behaviour?

shayrybak · 2024-10-22T12:49:41Z

Yes, we rolled back to 1.22.20

rahul656 · 2024-10-22T13:05:26Z

Thanks @shayrybak for the reply. Unfortunately... our node-local-dns metrics were broken so we can not verify if this issue started after the upgrade to 1.23.1. So we will go back to the older version and test if that helps. I will update it here once we have tested it on an older version.

dermasmid · 2024-10-28T14:22:01Z

@rahul656 did it help?

rahul656 · 2024-10-31T13:58:25Z

@dermasmid We have to soak the changes in the lower environment for a week before we can deploy to prod. I will let you know sometime at the end of next week.

rahul656 · 2024-11-04T10:47:48Z

@dermasmid 1.22.20 version has not fixed the issue in our test environment so we are not moving this change to our prod clusters. It seems like the issue we face differs from what @shayrybak experienced.

rahul656 · 2024-11-06T16:35:14Z

Since the failures happen during the first 3 minutes after the node-local-dns startup, we are planning to increase the InitialDelaySeconds from 60 seconds to 180 seconds and test in the lower environment to see if that would help.

vgrigoruk · 2024-11-29T10:23:11Z

Hey @rahul656. I'm curious if increasing initialDelaySeconds helped in your case? As we experience exactly the same issue during the first few minutes after node-local-dns startup.

rahul656 · 2024-11-29T10:58:41Z

@vgrigoruk no, it has not helped. The IPtable rules to redirect the DNS traffic to node-local-dns pods are added after a fixed interval of 60 seconds and do not depend on the liveness probe configuration. So increasing the liveness probe InitialDelay interval made no difference since traffic was still routed to node-local-dns pods after 60 seconds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node-local-dns-cache DNS i/o timeout errors at startup #648

node-local-dns-cache DNS i/o timeout errors at startup #648

rahul656 commented Oct 14, 2024 •

edited

Loading

shayrybak commented Oct 22, 2024

rahul656 commented Oct 22, 2024

shayrybak commented Oct 22, 2024

rahul656 commented Oct 22, 2024

dermasmid commented Oct 28, 2024

rahul656 commented Oct 31, 2024

rahul656 commented Nov 4, 2024

rahul656 commented Nov 6, 2024 •

edited

Loading

vgrigoruk commented Nov 29, 2024

rahul656 commented Nov 29, 2024

node-local-dns-cache DNS i/o timeout errors at startup #648

node-local-dns-cache DNS i/o timeout errors at startup #648

Comments

rahul656 commented Oct 14, 2024 • edited Loading

shayrybak commented Oct 22, 2024

rahul656 commented Oct 22, 2024

shayrybak commented Oct 22, 2024

rahul656 commented Oct 22, 2024

dermasmid commented Oct 28, 2024

rahul656 commented Oct 31, 2024

rahul656 commented Nov 4, 2024

rahul656 commented Nov 6, 2024 • edited Loading

vgrigoruk commented Nov 29, 2024

rahul656 commented Nov 29, 2024

rahul656 commented Oct 14, 2024 •

edited

Loading

rahul656 commented Nov 6, 2024 •

edited

Loading