-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node-local-dns-cache DNS i/o timeout errors at startup #648
Comments
we saw the exact same behavior when trying to upgrade from 1.22.20 to 1.23.1 |
Thanks @shayrybak for letting us know. Could you tell me if you rolled back to an older version to fix this behaviour? |
Yes, we rolled back to 1.22.20 |
Thanks @shayrybak for the reply. Unfortunately... our node-local-dns metrics were broken so we can not verify if this issue started after the upgrade to 1.23.1. So we will go back to the older version and test if that helps. I will update it here once we have tested it on an older version. |
@rahul656 did it help? |
@dermasmid We have to soak the changes in the lower environment for a week before we can deploy to prod. I will let you know sometime at the end of next week. |
@dermasmid 1.22.20 version has not fixed the issue in our test environment so we are not moving this change to our prod clusters. It seems like the issue we face differs from what @shayrybak experienced. |
Since the failures happen during the first 3 minutes after the node-local-dns startup, we are planning to increase the InitialDelaySeconds from 60 seconds to 180 seconds and test in the lower environment to see if that would help. |
Hey @rahul656. I'm curious if increasing initialDelaySeconds helped in your case? As we experience exactly the same issue during the first few minutes after node-local-dns startup. |
@vgrigoruk no, it has not helped. The IPtable rules to redirect the DNS traffic to node-local-dns pods are added after a fixed interval of 60 seconds and do not depend on the liveness probe configuration. So increasing the liveness probe InitialDelay interval made no difference since traffic was still routed to node-local-dns pods after 60 seconds. |
We are seeing DNS i/o timeout failures for dns lookups on a newly spun up eks node and 1-2 minutes after node-local-dns daemon set pod has started and serving traffic.
The issue is more prominent in our large clusters
Image version - 1.23.1
eks version - 1.26.15
node-local-dns confg
coedns config
node-local-dns error message
[ERROR] plugin/errors: 2 **********. A: read udp 10.176.117.198:37607->10.64.0.2:53: i/o timeout
The upstream server here is AWS Route 53 DNS resolver and i have confirmed through the VPC DNS query logs that Route 53 resolver has returned the response fine with in couple of ms.
Have also confirmed its not the AWS VPC resolver limits of 1000 packets/ seconds from a single ENI. AWS R53 resolver returns the response fine as mentioned.
Simulation -
I have tried the following methods to simulate this in test but failed to do so.
TCP dump output
From the tcpdump it was noted that for approx 3-5 seconds node-local-dns was hung/slow in returning responses to the clients, even though it has received the responses from upstream dns server.
Here is the behaviour i notice from the dump
For i/o failures in node-local-dns i see that node-local-dns pod is returning server failure to the clients.
The text was updated successfully, but these errors were encountered: