Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd database space exceeded due to many old retinaendpoints.retina.sh objects #1132

Open
tomglaza opened this issue Dec 12, 2024 · 0 comments
Labels
area/operator type/bug Something isn't working

Comments

@tomglaza
Copy link

Describe the bug
Retina sometimes fails to remove retinaendpoints.retina.sh objects leading to errors: ‘etcdserver: mvcc: database space exceeded’ and stop cluster operation.

To Reproduce
It is difficult to pinpoint clear steps to get the problem, as the problem (at least since the last update) occurs periodically. Most obsolescence occurs in namespaces where tasks are started using spark-operator. Many of the pods in this namespace end up with the status: Error, ContainerStatusUnknown or OOMKilled.
Last time I deleted all retinaendpoints.retina.sh objects (2 weeks ago it was 10 times more than pods), all was well for a while. Now I see that the problem must have occurred again, below is a etcd database summary:

[root@master-3 ~]# etcdctl get /registry --prefix --keys-only | grep -v ^$ | awk -F '/'  '{ h[$3]++ } END {for (k in h) print h[k], k}' | sort -nr | head
20811 events
8927 retina.sh
3785 cilium.io
3500 kyverno.io
2895 argoproj.io
1929 pods
1169 configmaps
1032 services
1010 replicasets
633 secrets

As you can see, the number of retina.sh objects is much higher than the number of pods or cilium.io objects, which in my opinion is an incorrect condition.

Expected behavior
The number of retina.sh objects in the etcd database should not significantly exceed the number of pods objects

Platform (please complete the following information):

  • OS: Alma Linux 8
  • Kubernetes Version: 1.30.6
  • Host: self-host
  • Retina Version: v0.0.19

Additional context
Add any other context about the problem here.

@nddq nddq added type/bug Something isn't working area/operator labels Dec 12, 2024
@nddq nddq moved this to Accepted in Retina Triage Board Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operator type/bug Something isn't working
Projects
Status: Accepted
Development

No branches or pull requests

2 participants