Mastering Workload-Aware Scheduling in Kubernetes v1.36: A Step-by-Step Guide

Introduction

Kubernetes v1.36 introduces groundbreaking improvements for scheduling AI/ML and batch workloads, which require more than simple per-Pod decision-making. This release separates concerns between a static Workload template and a runtime PodGroup object, enabling atomic scheduling, topology awareness, and advanced preemption. In this guide, you will learn how to set up and use these new features to manage complex workloads efficiently.

Mastering Workload-Aware Scheduling in Kubernetes v1.36: A Step-by-Step Guide

What You Need

A Kubernetes cluster running v1.36 (or later) with the scheduling.k8s.io/v1alpha2 API enabled.
kubectl configured to access your cluster.
Basic understanding of Kubernetes Pods, controllers (e.g., Job), and scheduling concepts.
Optional: A working knowledge of Dynamic Resource Allocation (DRA) if using resource claims.

Step-by-Step Instructions

Step 1: Understand the API Separation

In v1.36, the Workload API acts only as a static template for Pod groups. The runtime state—such as scheduling policy and individual Pod conditions—moves to the new PodGroup API (scheduling.k8s.io/v1alpha2). This clean split improves scalability by allowing per-replica sharding of status updates and simplifies the scheduler’s logic. Familiarize yourself with these two resources before proceeding.

Step 2: Define a Workload Template

Create a YAML file for your Workload object. Inside the spec.podGroupTemplates section, define one or more templates—each representing a group of Pods that must be scheduled together. For example, a gang scheduling policy requires a minimum number of Pods (minCount) before the group becomes schedulable.

apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
  name: training-job-workload
  namespace: some-ns
spec:
  podGroupTemplates:
  - name: workers
    schedulingPolicy:
      gang:
        minCount: 4

The Workload does not contain runtime state; it only describes the desired group structure.

Step 3: Create the Workload Object

Apply the YAML to your cluster:

kubectl apply -f workload.yaml

This registers the template. A controller (such as the built-in Job controller or a custom one) will later generate PodGroup instances based on these templates.

Step 4: Let the Controller Generate PodGroup Instances

In v1.36, the Workload controller automatically stamps out runtime PodGroup objects from the templates you defined. Each PodGroup carries the effective scheduling policy and a reference to its parent template. It also includes a status.conditions array that mirrors the scheduling states of individual Pods. To inspect a PodGroup:

kubectl get podgroup -n some-ns

The scheduler now reads only the PodGroup, not the Workload, for faster decision-making.

Step 5: Configure the Scheduler for PodGroup Scheduling

Kube-scheduler in v1.36 includes a new PodGroup scheduling cycle. No additional configuration is required if you are using the default scheduler—it automatically recognizes PodGroups. However, if you have a custom scheduler, ensure it implements the podgroup scheduling plugin. This cycle enables atomic workload processing: all Pods in a group are considered together, which is essential for gang scheduling and batch jobs.

Step 6: Leverage Topology-Aware Scheduling and Preemption

v1.36 introduces initial support for topology-aware scheduling for PodGroups, which tries to place the entire group within a defined topology domain (e.g., same rack or GPU node). Also, workload-aware preemption improves fairness by considering PodGroups when preempting lower-priority Pods. To enable these, set appropriate topology keys and priority classes in your PodGroup templates or via the Workload’s spec.podGroupTemplates[].schedulingPolicy. For example:

schedulingPolicy:
  topology:
    topologyKey: kubernetes.io/hostname

Step 7: Use ResourceClaim for Dynamic Resource Allocation (DRA)

If your workload requires specialized hardware (e.g., GPUs, FPGAs), attach a ResourceClaim to the Workload or PodGroup. This unlocks Dynamic Resource Allocation, allowing Pods within a group to share resources efficiently. Define a ResourceClaim template under the Workload’s spec.podGroupTemplates[].resourceClaims or use an existing claim. Example:

resourceClaims:
  - name: gpu-claim
    source:
      resourceClaimTemplateName: gpu-claim-template

Then reference the claim in the Pod template (usually via a container’s resources.claims). The scheduler will account for these resources when making group-level decisions.

Step 8: Integrate with the Job Controller (First Phase)

v1.36 ships the first phase of integration between the Job controller and the new Workload/PodGroup API. When you create a Job using the standard batch/v1 API, the Job controller can automatically generate a Workload object with appropriate pod group templates. To use this, enable the JobWorkload feature gate (if not already on by default) and set the annotation scheduling.k8s.io/workload-name on the Job’s Pod template. The controller will then create a PodGroup per replica set or per index, depending on your Job configuration.

Tips for Success

Start small: Test with a simple gang of 2-4 Pods before scaling to large batch jobs.
Monitor PodGroup status: Use kubectl describe podgroup to see conditions like PodGroupScheduled or PodGroupFailed.
Combine with priority classes: Set different priority levels for PodGroups to control preemption behavior.
Use namespaces: Isolate Workload and PodGroup resources in separate namespaces to avoid conflicts.
Keep security in mind: RBAC roles should allow creation of workloads and podgroups under scheduling.k8s.io.
Performance considerations: Because PodGroup status shards updates per replica, large groups scale better than in v1.35. However, keep the number of templates manageable.
Stay updated: Future releases will expand topology-aware features and Job integration—check the changelog regularly.

Tags: