ARC in Kubernetes mode issue with workflow nodes scheduling #3376

DenisPalnitsky · 2024-03-25T12:03:13Z

DenisPalnitsky
Mar 25, 2024

I'm testing ARC in Kubernetes mode, and I can't figure out how it could work reliably without failing jobs. Here's the problem I'm facing.

Input:

Kubernetes has nodes, and each node has a certain resource capacity. Based on that capacity, Kubernetes schedules pods on specific nodes.
To run a job, ARC creates two pods: a runner and a workflow (the one that actually runs the job). Both of these pods must have a single PV (Persistent Volume) assigned.

Imagine that we have small nodes that can only accommodate three containers. In that scenario, if ARC needs to schedule two jobs, it will schedule "Runner Pod 1" and "Workflow Pod 1" on the first node. Then, to run the second job, it will schedule "Runner Pod 2" on the first node, but the node will run out of capacity. Therefore, "Workflow Pod 2" cannot be scheduled on Node1 (due to no resources) and cannot be scheduled on Node2 because the PV is attached to Node1. This causes the job to fail.

With larger nodes, the situation may get even worse. Kubernetes can schedule multiple Runner Pods on one node, and there will be no capacity to schedule multiple workflow jobs there.

The fundamental problem is that when a job is scheduled, K8s needs to know in advance the resources that will be used by two pods (runner and worker), and there is no way to let the k8s scheduler know that in advance because the second pod is created by the first one.

Is there anything I'm missing that could address this issue?

DenisPalnitsky · 2024-05-06T08:49:53Z

DenisPalnitsky
May 6, 2024
Author

One of the ideas to address this issue and get rid of PVs is to push required data from runner pod to a workflow pod using cpToPod function. I tested this approach on workflows running in a custom container and did not find any issues.
@nikola-jokic I would love to hear you feedback on that and finalize the PR if the idea seems reasonable

0 replies

hawkesn · 2024-08-23T16:14:14Z

hawkesn
Aug 23, 2024

I am also running into this same issue with using Kubernetes mode. There are some alternative options I thought of (some are not GREAT though...):

Instead of sharing a PV that's mounted direct to the node (eg. EBS on AWS), share a PV that's network-attached (eg. EFS on AWS) so the workflow pod doesn't have to go on the same node. This would require the workflow pod to allow being scheduled 'anywhere' (or based on some affinity/nodeSelector).
You can put in a podAntiAffinity on the runner pod using topologyKey: <hostname>. This essentially means you have a dedicated node PER runner + workflow pod. This could be horribly inefficient if you've selected the wrong node size. You would need to force an appropriately sized node. (Eg. runner pod needs 2 cpu + workflow pod needs 8 = Use a node that has at least 10 CPU). This doesn't scale well organizationally either if you have runners of varying sizes, you would need to create a "node pool" per runner. This also doesn't scale well if you have a complex workflow (eg. matrix in your GHA workflow.yaml because it would need to provision more nodes per runner, and this breaks if the subsequent runners/workers have different compute/memory requests.)
runner and workflow in the same pod but as different containers. When it comes to pod scheduling, Kubernetes looks at the total requests of a given pod and it's containers to find an appropriately sized node. There won't be scheduling issues because the total size (ie. runner+workflow) is already known. It's also easier to share data between two containers in the same pod - you don't need to create a PV if it uses ephemeral disk and the containers have mounted each other's file paths. This approach would work even with subsequent runner+workflow spawns in matrix or otherwise since the relationship is always 1-1. This design is actually how the Kubernetes Plugin for Jenkins works.

If I'd had to chose an option, 3 would be ideal.

4 replies

towolf Sep 5, 2024

About 3, just getting started with GHA, so not quite firm on the details. The runner is already running and only picks up the job? And then a new container needs to be spawned?

Or can a fresh runner get spawned and include the extra container from the start?

What about EphemeralContainers?

hawkesn Sep 5, 2024

@towolf, 3 is an idea but not reality. The way GHA works now is there is a listener pod that checks a queue on GitHub's end for that particular runner-set (there was an architecture diagram somewhere but I can't find it). Once the listener pod finds it needs to scale up, the listener pod will launch a new workflow pod. If you are in DinD mode, the workflow pod will proceed to run your workflow via DinD. If you are in kubernetes mode, a new pod is spun up called runner. Therein lies the problem.

To answer your questions:

The runner is already running and only picks up the job?

No, the listener pod launches a workflow pod - which either DinDs the workflow or spawns a runner pod

Or can a fresh runner get spawned and include the extra container from the start?

Fresh runner gets spawned, does not include any extra containers (unless you set it up that way).

What about EphemeralContainers?

On the docs: Ephemeral containers differ from other containers in that they lack guarantees for resources or execution, and they will never be automatically restarted, so they are not appropriate for building applications. Is something I do not want my build containers to be running on 😆

DenisPalnitsky Sep 6, 2024
Author

@towolf

Instead of sharing a PV that's mounted direct to the node (eg. EBS on AWS), share a PV that's network-attached (eg. EFS on AWS)...

This option I did not try due to limitations of infrastructure that we run. You need to dynamically provision PVs with EFS which was a blocker for us due to internal reasons. In theory that should work however it's not clear how would scale.

You can put in a podAntiAffinity on the runner pod using topologyKey:....

With that approach, you quite often may find yourself in a situation when x runners are scheduled on a node and they take all the resources so that workflow could not be scheduled on it and that causes random jobs stack forever.

runner and workflow in the same pod but as different containers. When it comes to pod scheduling,....

As you mentioned, that's an idea and as far as I understand, it requires significant efforts to implement. Right now same runner software is running on a Selfhosted, Github-hosted, ARC and it works in a way that there must be listener for each runner and that listener manages execution of single job that it acquires. It would make sense for K8s to have a single listener that can spawn job pods based on demand and I don't see technical limitations for that. Someone just has to do it :D

4th option would be to not use PV and copy artefacts to workflow pod directly as suggested here

towolf Sep 6, 2024

On the docs: Ephemeral containers differ from other containers in that they lack guarantees for resources or execution, and they will never be automatically restarted, so they are not appropriate for building applications. Is something I do not want my build containers to be running on 😆

Well, we discussed this in my team and my point was that, firstly, resource requests could just be defined on the main container and it counts for the whole Pod. Secondly, restarts are not needed. Thirdly, the workflow container wouldn't need to declare containerPorts? So the limitations could be tolerable?

JordanP · 2024-11-26T20:11:28Z

JordanP
Nov 26, 2024

@DenisPalnitsky thanks for the nice description of the issue. I think most people solved this issue with ACTIONS_RUNNER_USE_KUBE_SCHEDULER env var set to "true" and a kubernetesModeWorkVolumeClaim (in Helm) set to a volume with accessModes: ["ReadWriteMany"]

Yet it;s not optimal: ReadWriteMany is either slow or expensive. So actions/runner-container-hooks#160 makes a lot of sense.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARC in Kubernetes mode issue with workflow nodes scheduling #3376

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

ARC in Kubernetes mode issue with workflow nodes scheduling #3376

DenisPalnitsky Mar 25, 2024

Input:

Replies: 3 comments · 4 replies

DenisPalnitsky May 6, 2024 Author

hawkesn Aug 23, 2024

towolf Sep 5, 2024

hawkesn Sep 5, 2024

DenisPalnitsky Sep 6, 2024 Author

towolf Sep 6, 2024

JordanP Nov 26, 2024

DenisPalnitsky
Mar 25, 2024

Replies: 3 comments 4 replies

DenisPalnitsky
May 6, 2024
Author

hawkesn
Aug 23, 2024

DenisPalnitsky Sep 6, 2024
Author

JordanP
Nov 26, 2024