Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Cannot connect to the Docker daemon" errors appear more frequently as more runners we deploy #3828

Open
4 tasks done
snavarro-factorial opened this issue Nov 29, 2024 · 2 comments
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@snavarro-factorial
Copy link

snavarro-factorial commented Nov 29, 2024

Checks

Controller Version

0.9.3

Deployment Method

ArgoCD

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

  1. Set different variables to the runner to prevent the error from appearing:
- name: STARTUP_DELAY_IN_SECONDS
  value: "10"
- name: DISABLE_WAIT_FOR_DOCKER
  value: "false"
- name: DOCKER_ENABLED
  value: "true"
- name: WAIT_FOR_DOCKER_SECONDS
  value: "180"
  1. Set the total amount of runners more than around 200.
  2. Run any pipeline that has to do any kind of "docker pull/build/etc" command.
    3.1. If less than around 200 runners, only 1 or 2 runners fail through the week because of docker.
    3.2. If more than around 200 runners, around a third or the runners fail and jobs have to be re-runned.

Describe the bug

Pipeline step complains that docker is not running:
timestamped_error

Describe the expected behavior

It should do any of one of these two options:

  1. Run without issues.
  2. Runner auto-killed because of the DISABLE_WAIT_FOR_DOCKER check.

Additional Context

apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
  name: gha-runner-scale-set-${name}
  namespace: flux-system
  labels:
    app.kubernetes.io/component: runner-scale-set
spec:
  targetNamespace: ${namespace}
  releaseName: ${name}
  chart:
    spec:
      chart: gha-runner-scale-set
      version: ${arc_version:=0.9.3}
      sourceRef:
        kind: HelmRepository
        name: gha-runner-scale-set
        namespace: flux-system
  interval: 30m
  install:
    crds: CreateReplace
  upgrade:
    crds: CreateReplace
  values:
    minRunners: ${min_runners:=1}
    maxRunners: ${max_runners:=2}
    githubConfigUrl: https://github.com/factorialco
    githubConfigSecret: actions-runner-secrets
    runnerGroup: ${runner_group}
    containerMode:
      type: dind
    listenerTemplate:
      metadata:
        annotations:
          prometheus.io/scrape: "true"
          prometheus.io/port: "8080"
          prometheus.io/path: "/metrics"
      spec:
        containers:
          - name: listener
    template:
      spec:
        shareProcessNamespace: true
        releaseName: runner-scale-set-${name}
        restartPolicy: Never
        initContainers:
          - name: clone-factorial-repository
            image: ${image:=mirror.gcr.io/factorialdx/actions-runner:2.320.0-runner-setv2}
            volumeMounts:
              - mountPath: /home/runner/_work
                name: work
              - mountPath: /home/runner/cache
                name: cache
              - mountPath: /scripts
                name: clone-factorial-repository-script
            command: ["/scripts/clone-factorial-repository.sh"]
            envFrom:
              - secretRef:
                  name: actions-runner-secrets
        containers:
          - name: runner
            securityContext:
              privileged: true
            imagePullPolicy: IfNotPresent
            image: ${image:=mirror.gcr.io/factorialdx/actions-runner:2.320.0-runner-setv2}
            command: ["/home/runner/run.sh"]
            env:
              - name: STARTUP_DELAY_IN_SECONDS
                value: "10"
              - name: DISABLE_WAIT_FOR_DOCKER
                value: "false"
              - name: DOCKER_ENABLED
                value: "true"
              - name: WAIT_FOR_DOCKER_SECONDS
                value: "180"
            resources:
              requests:
                cpu: "1"
                memory: "8Gi"
              limits:
                memory: "16Gi"
            volumeMounts:
              - name: work
                mountPath: /home/runner/_work
              - mountPath: /tmp
                name: tmp
              - mountPath: /home/runner/cache
                name: cache
        volumes:
          - name: work
            emptyDir: {}
          - name: tmp
            emptyDir:
              medium: Memory
          - name: cache
            hostPath:
              path: /cache
          - name: clone-factorial-repository-script
            configMap:
              name: clone-factorial-repository
              defaultMode: 0777

Controller Logs

Runner name to copy&search on logs: build-s-99pst-runner-bnhsl
https://gist.github.com/snavarro-factorial/ee965f37114d0ac4589169012cc098a6

Runner Pod Logs

Export format in csv:
https://gist.github.com/snavarro-factorial/796fba24ba5c7f854d3b95f04b636021
@snavarro-factorial snavarro-factorial added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Nov 29, 2024
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@snavarro-factorial
Copy link
Author

Added logs of the failed pod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

1 participant