Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-local-dns-cache DNS i/o timeout errors at startup #648

Open
rahul656 opened this issue Oct 14, 2024 · 10 comments
Open

node-local-dns-cache DNS i/o timeout errors at startup #648

rahul656 opened this issue Oct 14, 2024 · 10 comments

Comments

@rahul656
Copy link

rahul656 commented Oct 14, 2024

We are seeing DNS i/o timeout failures for dns lookups on a newly spun up eks node and 1-2 minutes after node-local-dns daemon set pod has started and serving traffic.

The issue is more prominent in our large clusters

Image version - 1.23.1
eks version - 1.26.15

node-local-dns confg

cluster.local:53 {
        errors
        cache {
                success 9984 61
                denial 9984 61
        }
        reload
        loop
        bind 169.254.20.10 172.20.0.10
        forward . __PILLAR__CLUSTER__DNS__ {
                force_tcp
        }
        prometheus :9253
        health 169.254.20.10:8080
        }
    in-addr.arpa:53 {
        errors
        cache 61
        reload
        loop
        bind 169.254.20.10 172.20.0.10
        forward . __PILLAR__CLUSTER__DNS__ {
                force_tcp
        }
        prometheus :9253
        }
    ip6.arpa:53 {
        errors
        cache 61
        reload
        loop
        bind 169.254.20.10 172.20.0.10
        forward . __PILLAR__CLUSTER__DNS__ {
                force_tcp
        }
        prometheus :9253
        }
    .:53 {
        errors
        cache 61
        reload
        loop
        bind 169.254.20.10 172.20.0.10
        forward . __PILLAR__UPSTREAM__SERVERS__
        prometheus :9253
        }

coedns config

Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        loop
        cache 30
        reload
        loadbalance
    }

node-local-dns error message

[ERROR] plugin/errors: 2 **********. A: read udp 10.176.117.198:37607->10.64.0.2:53: i/o timeout

The upstream server here is AWS Route 53 DNS resolver and i have confirmed through the VPC DNS query logs that Route 53 resolver has returned the response fine with in couple of ms.
Have also confirmed its not the AWS VPC resolver limits of 1000 packets/ seconds from a single ENI. AWS R53 resolver returns the response fine as mentioned.

Simulation -
I have tried the following methods to simulate this in test but failed to do so.

  • Tried with increased DNS load on a node-local-dns pod.
  • Tried with an increased DNS load after disabling the node-local-dns cache. 
  • Tried increasing the DNS load and restarting the node-local-dns pods.  (graceful as well as force kill)
  • Tried simulating the xtable lock contention along with the node-local-dns pods restart (graceful as well as force kill)

TCP dump output

From the tcpdump it was noted that for approx 3-5 seconds node-local-dns was hung/slow in returning responses to the clients, even though it has received the responses from upstream dns server.

Here is the behaviour i notice from the dump

  • Pod sends DNS query to Kube-DNS service.
  • DNS query is routed to node-local-dns pod.
  • Node-local-dns calls AWS Route53 DNS resolver
  • AWS Route53 DNS server responds.
  • node-local-dns pod passes the response to the application container after 1 - 4 seconds of receiving the response from the upstream dns server.

For i/o failures in node-local-dns i see that node-local-dns pod is returning server failure to the clients.

Screenshot
  • Some response are returned sucessfuly after 1-4 seconds of receiving response from upstream dns server.
  • The queries for which we see the i/o timeout , we always see that node-local-dns returns server failure to the clients.
  • The DNS queries from all application pods are impacted during this 3-5 seconds brief window where node-local-dns takes longer to return responses to the application pods / clients.
@shayrybak
Copy link

we saw the exact same behavior when trying to upgrade from 1.22.20 to 1.23.1

@rahul656
Copy link
Author

Thanks @shayrybak for letting us know. Could you tell me if you rolled back to an older version to fix this behaviour?

@shayrybak
Copy link

Yes, we rolled back to 1.22.20

@rahul656
Copy link
Author

Thanks @shayrybak for the reply. Unfortunately... our node-local-dns metrics were broken so we can not verify if this issue started after the upgrade to 1.23.1. So we will go back to the older version and test if that helps. I will update it here once we have tested it on an older version.

@dermasmid
Copy link

@rahul656 did it help?

@rahul656
Copy link
Author

@dermasmid We have to soak the changes in the lower environment for a week before we can deploy to prod. I will let you know sometime at the end of next week.

@rahul656
Copy link
Author

rahul656 commented Nov 4, 2024

@dermasmid 1.22.20 version has not fixed the issue in our test environment so we are not moving this change to our prod clusters. It seems like the issue we face differs from what @shayrybak experienced.

@rahul656
Copy link
Author

rahul656 commented Nov 6, 2024

Since the failures happen during the first 3 minutes after the node-local-dns startup, we are planning to increase the InitialDelaySeconds from 60 seconds to 180 seconds and test in the lower environment to see if that would help.

@vgrigoruk
Copy link

Hey @rahul656. I'm curious if increasing initialDelaySeconds helped in your case? As we experience exactly the same issue during the first few minutes after node-local-dns startup.

@rahul656
Copy link
Author

@vgrigoruk no, it has not helped. The IPtable rules to redirect the DNS traffic to node-local-dns pods are added after a fixed interval of 60 seconds and do not depend on the liveness probe configuration. So increasing the liveness probe InitialDelay interval made no difference since traffic was still routed to node-local-dns pods after 60 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants