Cluster Autoscaler for Azure

Configure automatic node scaling in Azure with Talos Linux and VMSS.

This guide explains how to configure cluster-autoscaler for automatic node scaling in Azure with Talos Linux.

Prerequisites

  • Azure subscription with Contributor Service Principal
  • az CLI installed
  • Existing Talos Kubernetes cluster with Kilo WireGuard mesh
  • Talos worker machine config

Step 1: Create Azure Infrastructure

1.1 Login with Service Principal

az login --service-principal \
  --username "<APP_ID>" \
  --password "<PASSWORD>" \
  --tenant "<TENANT_ID>"

1.2 Create Resource Group

az group create \
  --name <resource-group> \
  --location <location>

1.3 Create VNet and Subnet

az network vnet create \
  --resource-group <resource-group> \
  --name cozystack-vnet \
  --address-prefix 10.2.0.0/16 \
  --subnet-name workers \
  --subnet-prefix 10.2.0.0/24 \
  --location <location>

1.4 Create Network Security Group

az network nsg create \
  --resource-group <resource-group> \
  --name cozystack-nsg \
  --location <location>

# Allow WireGuard
az network nsg rule create \
  --resource-group <resource-group> \
  --nsg-name cozystack-nsg \
  --name AllowWireGuard \
  --priority 100 \
  --direction Inbound \
  --access Allow \
  --protocol Udp \
  --destination-port-ranges 51820

# Allow Talos API
az network nsg rule create \
  --resource-group <resource-group> \
  --nsg-name cozystack-nsg \
  --name AllowTalosAPI \
  --priority 110 \
  --direction Inbound \
  --access Allow \
  --protocol Tcp \
  --destination-port-ranges 50000

# Associate NSG with subnet
az network vnet subnet update \
  --resource-group <resource-group> \
  --vnet-name cozystack-vnet \
  --name workers \
  --network-security-group cozystack-nsg

Step 2: Create Talos Image

2.1 Generate Schematic ID

Create a schematic at factory.talos.dev with required extensions:

curl -s -X POST https://factory.talos.dev/schematics \
  -H "Content-Type: application/json" \
  -d '{
    "customization": {
      "systemExtensions": {
        "officialExtensions": [
          "siderolabs/amd-ucode",
          "siderolabs/amdgpu-firmware",
          "siderolabs/bnx2-bnx2x",
          "siderolabs/drbd",
          "siderolabs/i915-ucode",
          "siderolabs/intel-ice-firmware",
          "siderolabs/intel-ucode",
          "siderolabs/qlogic-firmware",
          "siderolabs/zfs"
        ]
      }
    }
  }'

Save the returned id as SCHEMATIC_ID.

2.2 Create Managed Image from VHD

# Download Talos Azure image
curl -L -o azure-amd64.raw.xz \
  "https://factory.talos.dev/image/${SCHEMATIC_ID}/<talos-version>/azure-amd64.raw.xz"

# Decompress
xz -d azure-amd64.raw.xz

# Convert to VHD
qemu-img convert -f raw -o subformat=fixed,force_size -O vpc \
  azure-amd64.raw azure-amd64.vhd

# Get VHD size
VHD_SIZE=$(stat -f%z azure-amd64.vhd)  # macOS
# VHD_SIZE=$(stat -c%s azure-amd64.vhd)  # Linux

# Create managed disk for upload
az disk create \
  --resource-group <resource-group> \
  --name talos-<talos-version> \
  --location <location> \
  --upload-type Upload \
  --upload-size-bytes $VHD_SIZE \
  --sku Standard_LRS \
  --os-type Linux \
  --hyper-v-generation V2

# Get SAS URL for upload
SAS_URL=$(az disk grant-access \
  --resource-group <resource-group> \
  --name talos-<talos-version> \
  --access-level Write \
  --duration-in-seconds 3600 \
  --query accessSAS --output tsv)

# Upload VHD
azcopy copy azure-amd64.vhd "$SAS_URL" --blob-type PageBlob

# Revoke access
az disk revoke-access \
  --resource-group <resource-group> \
  --name talos-<talos-version>

# Create managed image from disk
az image create \
  --resource-group <resource-group> \
  --name talos-<talos-version> \
  --location <location> \
  --os-type Linux \
  --hyper-v-generation V2 \
  --source $(az disk show --resource-group <resource-group> \
    --name talos-<talos-version> --query id --output tsv)

Step 3: Create Talos Machine Config for Azure

Create a machine config similar to the Hetzner one, with these Azure-specific changes:

machine:
  # Kilo annotations for WireGuard mesh (applied automatically on join)
  nodeAnnotations:
    kilo.squat.ai/location: azure
    kilo.squat.ai/persistent-keepalive: "20"
  nodeLabels:
    topology.kubernetes.io/zone: azure
  kubelet:
    nodeIP:
      validSubnets:
        - 10.2.0.0/24                  # Azure VNet subnet
    # Required for external cloud provider (ProviderID assignment)
    extraArgs:
      cloud-provider: external

All other settings (cluster tokens, control plane endpoint, extensions, etc.) remain the same as the Hetzner config.

Step 4: Create VMSS (Virtual Machine Scale Set)

IMAGE_ID=$(az image show \
  --resource-group <resource-group> \
  --name talos-<talos-version> \
  --query id --output tsv)

az vmss create \
  --resource-group <resource-group> \
  --name workers \
  --location <location> \
  --orchestration-mode Uniform \
  --image "$IMAGE_ID" \
  --vm-sku Standard_D2s_v3 \
  --instance-count 0 \
  --vnet-name cozystack-vnet \
  --subnet workers \
  --public-ip-per-vm \
  --custom-data machineconfig-azure.yaml \
  --security-type Standard \
  --admin-username talos \
  --authentication-type ssh \
  --generate-ssh-keys \
  --upgrade-policy-mode Manual

Step 5: Deploy Cluster Autoscaler

Create the Package resource:

apiVersion: cozystack.io/v1alpha1
kind: Package
metadata:
  name: cozystack.cluster-autoscaler-azure
spec:
  variant: default
  components:
    cluster-autoscaler-azure:
      values:
        cluster-autoscaler:
          azureClientID: "<APP_ID>"
          azureClientSecret: "<PASSWORD>"
          azureTenantID: "<TENANT_ID>"
          azureSubscriptionID: "<SUBSCRIPTION_ID>"
          azureResourceGroup: "<RESOURCE_GROUP>"
          azureVMType: "vmss"
          autoscalingGroups:
            - name: workers
              minSize: 0
              maxSize: 10

Apply:

kubectl apply -f package.yaml

Step 6: Kilo WireGuard Connectivity

Azure nodes are behind NAT, so their initial WireGuard endpoint will be a private IP. Kilo handles this automatically through WireGuard’s built-in NAT traversal when persistent-keepalive is configured (already included in the machine config from Step 3).

The flow works as follows:

  1. The Azure node initiates a WireGuard handshake to the on-premises leader (which has a public IP)
  2. persistent-keepalive sends periodic keepalive packets, maintaining the NAT mapping
  3. The on-premises Kilo leader discovers the real public endpoint of the Azure node through WireGuard
  4. Kilo stores the discovered endpoint and uses it for subsequent connections

Testing

Manual scale test

# Scale up
az vmss scale --resource-group <resource-group> --name workers --new-capacity 1

# Check node joined
kubectl get nodes -o wide

# Check WireGuard tunnel
kubectl logs -n cozy-kilo <kilo-pod-on-azure-node>

# Scale down
az vmss scale --resource-group <resource-group> --name workers --new-capacity 0

Autoscaler test

Deploy a workload to trigger autoscaling:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-azure-autoscale
spec:
  replicas: 3
  selector:
    matchLabels:
      app: test-azure
  template:
    metadata:
      labels:
        app: test-azure
    spec:
      nodeSelector:
        topology.kubernetes.io/zone: azure
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"

Troubleshooting

Node doesn’t join cluster

  • Check that the Talos machine config control plane endpoint is reachable from Azure
  • Verify NSG rules allow outbound traffic to port 6443
  • Check VMSS instance provisioning state: az vmss list-instances --resource-group <resource-group> --name workers

WireGuard tunnel not established

  • Verify the node has kilo.squat.ai/persistent-keepalive: "20" annotation
  • Verify the node has kilo.squat.ai/location: azure annotation (not just as a label)
  • Check NSG allows inbound UDP 51820
  • Inspect kilo logs: kubectl logs -n cozy-kilo <kilo-pod>
  • Check for “WireGuard configurations are different” messages repeating every 30 seconds – this indicates persistent-keepalive annotation is missing

VM quota errors

  • Check quota: az vm list-usage --location <location>
  • Request quota increase via Azure portal
  • Try a different VM family that has available quota

SkuNotAvailable errors

  • Some VM sizes may have capacity restrictions in certain regions
  • Try a different VM size: az vm list-skus --location <location> --size <prefix>