Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors Encountered During nim-operator Startup #219

Open
wqlparallel opened this issue Nov 7, 2024 · 2 comments
Open

Errors Encountered During nim-operator Startup #219

wqlparallel opened this issue Nov 7, 2024 · 2 comments
Assignees

Comments

@wqlparallel
Copy link

1. Quick Debug Information

  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
  • NIM Operator Version: main

2. Issue or feature description

I have encountered a couple of issues while using the nim-operator, and I wanted to share them along with the solutions I found.

  1. Initial Startup Error:

I received the following error during the startup of nim-operator:

E1107 03:29:01.158923 1 reflector.go:158] "Unhandled Error" err="sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:106: Failed to watch *v2.HorizontalPodAutoscaler: failed to list *v2.HorizontalPodAutoscaler: horizontalpodautoscalers.autoscaling is forbidden: User "system:serviceaccount:k8s-nim-operator-system:k8s-nim-operator-controller-manager" cannot list resource "horizontalpodautoscalers" in API group "autoscaling" at the cluster scope" logger="UnhandledError"

Upon investigation, I discovered that in the config/rbac/role.yaml file, the term horizontalpodautoscalars should be corrected to horizontalpodautoscalers

  1. After correcting the spelling mistake, the nim-operator reported another error:

2024-11-07T04:24:46Z INFO Starting Controller {"controller": "nimservice" "controllerGroup" "apps.nvidia.com" "controllerKind": "NIMService"}
2024-11-07704:24:46Z ERROR controller-runtime.source.EventHandler if kind is a CRD, it should be installed before calling Start {"kind": "NemoGuardrail.apps.nvidia.com", "error": "no matches for kind "NemoGuardrail" in version "apps.nvidia.com/vlalpha1""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:71
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:53
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/Loop.go:54
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source. (*Kind[...]).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:64

This issue was resolved by installing bases/apps.nvidia.com_nemoguardrails.yaml in the config/crd/kustomization.yaml.

With these changes, nim-operator is now running correctly:

kubectl get po -n k8s-nim-operator-system
NAME READY STATUS RESTARTS AGE
k8s-nim-operator-controller-manager-68d67d4b69-rr2rn 1/1 Running 0 177m

Additionally, I have made the relevant code changes. Would it be possible for me to submit these modifications to the community?

Thank you!

@shivamerla
Copy link
Collaborator

@wqlparallel Thanks for point this out. Yes, please feel free to raise a PR with your suggested changes. With the Helm chart these are setup right, but directly install from the generated manifests will fail. You need to update the RBAC here and generate manifests again.

@Devin-Yue
Copy link

How to reproduce this issue as we did not see before?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants