Managing Clusters

This guide covers common management tasks applicable to both Cloud Init and Talos Kubernetes clusters on Xelon HQ, including node operations, access configuration, monitoring, and production best practices.

Node Pool Operations

Resizing a Node Pool

To change the compute resources (CPU, RAM, disk) for nodes in a pool, navigate to the cluster details page, select the target node pool, and click Edit. Adjust the resource values and confirm. Nodes are updated via a rolling process to minimize downtime.

Adding a Node Pool

Click Add Node Pool from the cluster details page to create a new pool with different resource specifications. This is useful for running workloads with distinct resource requirements on the same cluster.

Removing a Node Pool

Select the node pool to remove and click Delete. Workloads running on nodes in the pool are evicted and rescheduled to other available nodes before the pool is removed.

Capacity Planning

Before removing a node pool, verify that remaining pools have sufficient capacity to absorb the rescheduled workloads. Otherwise, pods may remain in a pending state.

Recycling Nodes

Cloud Init clusters only

Recycling replaces an existing node with a new one of the same specification. This is useful when a node is in a degraded state or you want to apply underlying infrastructure changes. To recycle a node:

  1. Navigate to the node pool in the cluster details view.
  2. Click Recycle next to the node you want to replace.
  3. The node is cordoned and drained. Workloads are rescheduled to other nodes.
  4. The old node is deleted and a new node is provisioned in its place.

Kubeconfig

Downloading the Kubeconfig

The kubeconfig file provides the credentials and endpoint information needed to connect to your cluster using kubectl. Download it from the cluster details page by clicking Download config.

Using the Kubeconfig

Set the KUBECONFIG environment variable to point to the downloaded file:

# Point kubectl to your cluster
export KUBECONFIG=~/Downloads/my-cluster-kubeconfig.yaml

# Verify cluster access
kubectl cluster-info

# List nodes
kubectl get nodes -o wide

Alternatively, merge the kubeconfig into your default configuration:

# Merge into default kubeconfig
export KUBECONFIG=~/.kube/config:~/Downloads/my-cluster-kubeconfig.yaml
kubectl config view --merge --flatten > ~/.kube/config.merged
mv ~/.kube/config.merged ~/.kube/config

# Switch context
kubectl config use-context my-cluster
Security

Treat kubeconfig files as sensitive credentials. Do not commit them to version control or share them over insecure channels.

Service Accounts

Xelon HQ provides a built-in interface for managing Kubernetes service accounts directly from the cluster details page. Click Add Service to create a new service account with a specified name, namespace, and permission level (All permissions or Read-only). You can download the kubeconfig or copy the token for each service account.

For more advanced RBAC configuration, use kubectl to create custom roles and role bindings:

# Create a namespace for your application
kubectl create namespace my-app

# Create a service account
kubectl create serviceaccount my-app-sa -n my-app

Use Kubernetes RBAC (Role-Based Access Control) to assign least-privilege permissions to each service account. This limits the blast radius if a credential is compromised.

Monitoring Cluster Health

Xelon HQ provides cluster health indicators on the cluster details page. Key metrics to watch include:

Metric Description Action Threshold
Node Status Ready, NotReady, or Unknown status for each node. Any node in NotReady for >5 minutes
CPU Utilization Aggregate CPU usage across all nodes. Sustained >80% usage
Memory Utilization Aggregate memory usage across all nodes. Sustained >85% usage
Disk Usage Storage consumption per node. >90% capacity
Pod Count Running and pending pods in the cluster. Pending pods for >5 minutes

Best Practices for Production Clusters

Production Checklist

Follow these recommendations when running production workloads on Xelon HQ Kubernetes clusters.

  • High availability: Run at least 3 control plane nodes (Talos clusters) and distribute worker nodes across multiple node pools.
  • Resource requests and limits: Define CPU and memory requests and limits for all pods to ensure fair scheduling and prevent resource contention.
  • Namespaces: Isolate workloads into namespaces by team, environment, or application.
  • RBAC: Use Kubernetes RBAC to enforce least-privilege access. Avoid using cluster-admin for application workloads.
  • Network policies: Implement Kubernetes network policies to control traffic between pods and namespaces.
  • Regular upgrades: Keep your Kubernetes version current to receive security patches and feature improvements.
  • Backups: Back up critical workload configurations and persistent data. Use a backup solution such as Velero for cluster-level backups.
  • Monitoring and alerting: Deploy a monitoring stack (Prometheus, Grafana) or integrate with your existing observability platform.