పేర్ల రఘురాం: kuber

---

1) Keep Kubernetes versions up to date

What will happen if you don’t: Security vulnerabilities, incompatibilities, unsupported behavior and lack of bug fixes.

How: Follow Kubernetes upgrade policy; test upgrades in staging; upgrade control plane first, then worker nodes.

Key commands:

kubectl version --short

kubeadm upgrade plan

kubeadm upgrade apply v1.27.6

Example: (upgrade plan is CLI-driven; no YAML)

---

2) Use multiple master/control-plane nodes for HA

What will happen if you don’t: Single control-plane node becomes a single point of failure; cluster control operations may stop.

How: Deploy at least 3 control-plane nodes with etcd quorum; use external load balancer in front of API servers.

Key commands: (control plane bootstrap via kubeadm / cloud provider)

# Example: check control plane endpoints

kubectl get endpoints -n default

YAML/snippet: (LB config is infra-specific; example kubeadm init with control-plane endpoint)

kubeadm init --control-plane-endpoint "api.mycluster.example:6443" --upload-certs

---

3) Label and taint nodes for workload segregation

What will happen if you don’t: Critical pods may co-locate with noisy or untrusted workloads; scheduling may place wrong apps on wrong hardware.

How: Use kubectl label and kubectl taint to dedicate nodes (e.g., GPU, high-memory).

Key commands:

kubectl label node node01 node-role.kubernetes.io/highmem=true

kubectl taint nodes node01 dedicated=highmem:NoSchedule

YAML (Pod using nodeSelector / toleration):

spec:

nodeSelector:

node-role.kubernetes.io/highmem: "true"

tolerations:

- key: "dedicated"

operator: "Equal"

value: "highmem"

effect: "NoSchedule"

---

4) Use Cluster Autoscaler (node autoscaling)

What will happen if you don’t: Under-provisioning during spikes causes pending pods; over-provisioning wastes cost.

How: Install Cluster Autoscaler configured per cloud provider; tune scale-up/down policies and node groups.

Key commands:

# check CA deployment

kubectl get deployment cluster-autoscaler -n kube-system

kubectl logs -f deploy/cluster-autoscaler -n kube-system

YAML (typical Deployment args excerpt):

spec:

containers:

- name: cluster-autoscaler

args:

- --cloud-provider=aws

- --nodes=1:10:node-group-name

---

5) Reserve system resources on nodes (system-reserved/kube-reserved)

What will happen if you don’t: Kubelet and system daemons can be starved of CPU/memory causing node instability.

How: Configure kubelet flags or kubelet config to reserve CPU/memory for system and kube components.

Key commands: (edit kubelet config or systemd args, then restart kubelet)

# example check

kubectl describe node node01 | grep -i reserved

kubelet config snippet:

kind: KubeletConfiguration

apiVersion: kubelet.config.k8s.io/v1beta1

systemReserved:

cpu: "500m"

memory: "512Mi"

kubeReserved:

cpu: "500m"

memory: "512Mi"

---

6) Monitor node health continuously

What will happen if you don’t: Node failures go unnoticed until apps fail; slow detection prolongs incidents.

How: Integrate Prometheus node exporters, alert on node:node_cpu:, node_memory_ and kube_node_status_condition.

Key commands:

kubectl get nodes

kubectl describe node <node>

PromQL example alert:

kube_node_status_condition{condition="Ready",status="true"} == 0

---

7) Spread workloads across zones/regions

What will happen if you don’t: AZ failure brings down many pods; reduced resilience and higher blast radius.

How: Use topologySpreadConstraints, pod anti-affinity, and multiple node pools across AZs.

Key commands: (inspect topology)

kubectl get nodes -o wide

kubectl get pods -o wide --field-selector=status.phase=Running

YAML (topologySpreadConstraints sample):

spec:

topologySpreadConstraints:

- maxSkew: 1

topologyKey: topology.kubernetes.io/zone

whenUnsatisfiable: DoNotSchedule

labelSelector:

matchLabels:

app: myapp

---

8) Avoid overcommitting node resources

What will happen if you don’t: Nodes become CPU/memory-saturated causing OOMKill, node pressure, eviction cascades.

How: Enforce resource requests/limits, and use ResourceQuotas & LimitRanges in namespaces.

Key commands:

kubectl get resourcequota -n <ns>

kubectl describe limitrange -n <ns>

YAML (limitrange example):

apiVersion: v1

kind: LimitRange

metadata: { name: limits }

spec:

limits:

- default:

cpu: "500m"

memory: "512Mi"

defaultRequest:

cpu: "200m"

memory: "256Mi"

type: Container

---

9) Tune OS & kernel for container workloads

What will happen if you don’t: Suboptimal networking, CPU scheduling, and disk performance; noisy neighbor issues.

How: Set sysctl parameters, enable overlayfs options, tune ephemeral storage and file descriptors, set net.ipv4.ip_forward=1, etc.

Key commands: (example sysctl inspect/apply)

sysctl net.ipv4.ip_forward

sudo sysctl -w net.ipv4.ip_forward=1

Example (sysctl in Pod via securityContext / sysctls):

securityContext:

sysctls:

- name: net.ipv4.ip_forward

value: "1"

---

10) Apply OS-level security updates regularly

What will happen if you don’t: Nodes become vulnerable to exploits; container runtimes and kernel exploits risk cluster compromise.

How: Patch OS in a rolling manner (cordon → drain → update → uncordon), use immutable images for hosts or managed node pools.

Key commands:

kubectl cordon node01

kubectl drain node01 --ignore-daemonsets --delete-local-data

# perform OS update on node

kubectl uncordon node01

YAML: (no YAML — operational workflow)

Alright — let’s go deep into Scenario 1 and Scenario 2, covering:

What will happen (impact & symptoms)

Why it happens (root cause)

How to fix it (step-by-step)

Commands to diagnose & resolve

YAML examples if applicable

---

Scenario 1 – Pods Stuck in CrashLoopBackOff

What will happen

Pod keeps restarting in a loop instead of running normally.

Status shows CrashLoopBackOff in kubectl get pods.

Application inside the container starts, fails quickly, and Kubernetes retries indefinitely (with backoff delay increasing each time).

Example:

$ kubectl get pods

NAME READY STATUS RESTARTS AGE

app-pod-1 0/1 CrashLoopBackOff 5 2m

---

Why it happens

Common causes:

1. Application code crashes (exception, missing file, bad config).

2. Wrong environment variables (DB host, credentials missing).

3. Port conflict (two processes binding same port).

4. Readiness/Liveness probes failing → Kubernetes kills and restarts container.

---

How to fix it

Step 1 – Check pod logs:

kubectl logs <pod-name> -n <namespace>

kubectl logs <pod-name> -n <namespace> --previous # previous container run

Step 2 – Describe pod for events:

kubectl describe pod <pod-name> -n <namespace>

Look for:

Probe failures

ImagePullBackOff

OutOfMemoryKilled

Step 3 – If it’s a config/env issue:

Update ConfigMap or Secret.

kubectl edit configmap <configmap-name> -n <namespace>

kubectl rollout restart deployment <deployment-name> -n <namespace>

Step 4 – If probe is too aggressive:

Relax initialDelaySeconds or timeoutSeconds.

---

YAML Example – Fixing a Liveness Probe Failure

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: demo

spec:

replicas: 1

selector:

matchLabels:

app: sample

template:

metadata:

labels:

app: sample

spec:

containers:

- name: sample-container

image: myregistry.com/sample:1.0

ports:

- containerPort: 8080

envFrom:

- configMapRef:

livenessProbe:

httpGet:

path: /health

port: 8080

initialDelaySeconds: 30

timeoutSeconds: 5

failureThreshold: 5

---

✅ Quick Tip: If you just want to debug without probe restarts:

kubectl edit deployment sample-app -n demo

# Remove the livenessProbe temporarily

---

Scenario 2 – Pods Stuck in Pending

What will happen

Pods stay in Pending state, never starting containers.

Seen in:

$ kubectl get pods

NAME READY STATUS RESTARTS AGE

app-pod-2 0/1 Pending 0 5m

---

Why it happens

1. No matching nodes (NodeSelector, Affinity, Taints prevent scheduling).

2. Insufficient resources (CPU/memory requests too high).

3. Storage issues (PVC cannot be bound to a PV).

4. Cluster Autoscaler not scaling up nodes.

---

How to fix it

Step 1 – Describe pod:

kubectl describe pod <pod-name> -n <namespace>

Look for:

0/3 nodes are available: insufficient memory

0/3 nodes are available: node(s) didn't match node selector

persistentvolumeclaim is not bound

Step 2 – If resource request is too high:

kubectl edit deployment <deployment-name> -n <namespace>

# Reduce requests under spec.containers.resources.requests

Step 3 – If PVC not bound:

Check:

kubectl get pvc -n <namespace>

kubectl describe pvc <pvc-name> -n <namespace>

Create matching PV if needed:

apiVersion: v1

kind: PersistentVolume

metadata:

spec:

capacity:

storage: 1Gi

accessModes:

- ReadWriteOnce

hostPath:

path: /data/pv1

Step 4 – If affinity/taint issue:

Remove or adjust affinity/taint rules in YAML.

---

YAML Example – Adjusting NodeSelector

apiVersion: apps/v1

kind: Deployment

metadata:

spec:

replicas: 1

selector:

matchLabels:

app: sample

template:

metadata:

labels:

app: sample

spec:

nodeSelector:

kubernetes.io/hostname: worker-1 # Remove if blocking scheduling

containers:

- name: sample-container

image: myregistry.com/sample:1.0

resources:

requests:

cpu: "100m"

memory: "128Mi"

---

Alright, let’s go deep into Scenario 1 and Scenario 2 with:

What will happen (Impact)

How it happens (Root cause)

Troubleshooting & Fix steps

Commands (for diagnosis and fix)

YAML examples (where applicable)

---

Scenario 1 – Pods Stuck in CrashLoopBackOff

What Will Happen

Pod continuously restarts after failing to start successfully.

Application downtime until the issue is fixed.

CPU/memory usage spikes due to repeated container restarts.

In production, this may cause cascading failures if dependent services rely on this pod.

---

How It Happens

Application process exits with a non-zero status code.

Missing or incorrect environment variables.

Dependencies (DB, API) not reachable.

Readiness/liveness probes failing repeatedly.

ConfigMap/Secret values missing or wrong.

---

Troubleshooting Steps

1. Check pod status and events

kubectl get pods -n <namespace>

kubectl describe pod <pod-name> -n <namespace>

2. Check logs of the container

kubectl logs <pod-name> -n <namespace> --previous

3. Verify configuration files and environment variables

kubectl exec -it <pod-name> -n <namespace> -- env

4. Check readiness/liveness probes

kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 readinessProbe

5. Check dependent services

kubectl run tmp-shell --rm -it --image=busybox -- sh

# ping DB, API, etc.

---

Example Fix YAML

If liveness/readiness probes are too strict:

livenessProbe:

httpGet:

path: /health

port: 8080

initialDelaySeconds: 30

periodSeconds: 10

readinessProbe:

httpGet:

path: /ready

port: 8080

initialDelaySeconds: 30

periodSeconds: 10

---

Production Tip

If issue persists but you need to stop restart loops temporarily:

kubectl scale deployment <deployment-name> --replicas=0 -n <namespace>

---

Scenario 2 – ImagePullBackOff / ErrImagePull

What Will Happen

Pod never starts because the container image cannot be pulled.

Affected application remains unavailable.

If multiple pods use the same image, a large part of the service might be down.

---

How It Happens

Wrong image name, tag, or registry URL.

Image deleted or tag changed in registry.

Private registry requires credentials but none provided.

Network/DNS issues blocking access to registry.

---

Troubleshooting Steps

1. Check events for detailed pull error

kubectl describe pod <pod-name> -n <namespace>

2. Verify image name and tag

kubectl get pod <pod-name> -n <namespace> -o yaml | grep image:

3. Test registry access from inside the cluster

kubectl run tmp-shell --rm -it --image=busybox -- sh

# Inside shell

wget <registry-url>

4. Check imagePullSecrets

kubectl get secrets -n <namespace>

kubectl describe secret <secret-name> -n <namespace>

---

Example Fix YAML

If using a private registry:

apiVersion: v1

kind: Pod

metadata:

spec:

containers:

- name: my-app

image: myprivateregistry.com/app:1.0

imagePullSecrets:

- name: myregistry-secret

Create secret:

kubectl create secret docker-registry myregistry-secret \

--docker-server=myprivateregistry.com \

--docker-username=myuser \

--docker-password=mypass \

--docker-email=myemail@example.com \

-n <namespace>

---

If you want, I can now do Scenario 3 (Pods stuck in Pending) and Scenario 4 (OOMKilled) in the same style so you’ll have a continuous production playbook. That will make it easier to compile into one printable PDF.

Great — continuing with the same deep-dive style. Here are Scenario 3 (Pending) and Scenario 4 (OOMKilled) with impact, causes, step-by-step troubleshooting, commands, YAML snippets, and quick production tips.

---

Scenario 3 – Pod Stuck in Pending

What will happen

Pod remains in Pending and never transitions to Running.

No containers start; the workload is unavailable.

Often a sign of scheduling/resource or storage problems.

How it happens (common root causes)

Insufficient cluster resources (CPU / memory) to satisfy requests.

NodeSelector / NodeAffinity / taints block scheduling.

PVC is not bound (no matching PV).

Pod topology constraints or quota limits preventing placement.

Cluster Autoscaler not configured or unable to scale.

Troubleshooting & Fix Steps

1. Describe the pod to see scheduler events

kubectl describe pod <pod-name> -n <ns>

# Look for messages like: "0/5 nodes are available: insufficient memory" or "node(s) didn't match node selector"

2. Check node capacity and available resources

kubectl get nodes -o wide

kubectl top nodes

kubectl describe node <node-name>

3. Check resource requests/limits of the pod

kubectl get pod <pod-name> -n <ns> -o yaml | yq '.spec.containers[].resources'

# or

kubectl describe pod <pod-name> -n <ns> | grep -A5 "Requests"

If requests too high → edit Deployment to lower requests.

4. Check node selectors / affinity / taints

kubectl get pod <pod-name> -n <ns> -o yaml | yq '.spec | {nodeSelector: .nodeSelector, affinity: .affinity, tolerations: .tolerations}'

kubectl get nodes --show-labels

kubectl describe node <node> | grep Taints -A2

Remove or relax overly strict selectors/affinities or add matching node labels.

5. If PVC is pending, inspect PVC/PV

kubectl get pvc -n <ns>

kubectl describe pvc <pvc-name> -n <ns>

kubectl get pv

Create a matching PV or adjust StorageClass.

6. If cluster autoscaler should add nodes, check CA logs

kubectl logs deploy/cluster-autoscaler -n kube-system

Adjust CA node-group min/max or node group configuration.

Commands to remediate (examples)

Reduce resource requests:

kubectl set resources deployment/<deploy> -n <ns> --requests=cpu=200m,memory=256Mi

Remove a nodeSelector (edit deployment):

kubectl edit deploy <deploy> -n <ns>

# remove spec.template.spec.nodeSelector section

Create a simple PV for PVC binding:

apiVersion: v1

kind: PersistentVolume

metadata:

spec:

capacity:

storage: 5Gi

accessModes:

- ReadWriteOnce

persistentVolumeReclaimPolicy: Retain

hostPath:

path: /mnt/data/pv-small

kubectl apply -f pv-small.yaml

Quick Production Tip

Enforce default requests via LimitRange and use ResourceQuotas to prevent runaway requests that keep pods pending.

---

Scenario 4 – OOMKilled (Container Killed Due to Out Of Memory)

What will happen

Container process is killed by the kernel (OOM Killer).

Pod restarts; repeated OOMs lead to CrashLoopBackOff or degraded service.

Memory pressure can affect co-located pods and node stability.

How it happens (common root causes)

Container memory limit too low for the workload.

Memory leak in the application.

Bursty workload without proper resource provisioning.

No limits set → node exhaustion leading to multiple pod evictions.

Troubleshooting & Fix Steps

1. Describe the pod to confirm OOMKilled

kubectl describe pod <pod-name> -n <ns> | grep -i -A5 "State"

# Look for 'Reason: OOMKilled' in container status

2. Check container logs & previous logs

kubectl logs <pod-name> -n <ns>

kubectl logs <pod-name> -n <ns> --previous

3. Check resource usage

kubectl top pod <pod-name> -n <ns>

kubectl top node <node-name>

4. Inspect kubelet and syslogs on the node (if you have node access)

journalctl -u kubelet -n 200

dmesg | grep -i -E "oom|killed process"

5. If memory leak suspected: attach profiler, heap dump, or increase logging to capture allocations.

Commands & Remediations

Increase memory limit (imperative)

kubectl set resources deployment/<deploy> -n <ns> --limits=memory=1Gi --requests=memory=512Mi

Edit deployment (declarative)

spec:

containers:

- name: app

image: myapp:1.2

resources:

requests:

memory: "512Mi"

cpu: "250m"

limits:

memory: "1Gi"

cpu: "1"

Temporarily reduce load by scaling down replicas or rate-limiting traffic via Ingress:

kubectl scale deploy <deploy> --replicas=1 -n <ns>

If node OOM is observed: cordon & drain node, investigate other pods:

kubectl cordon <node>

kubectl drain <node> --ignore-daemonsets --delete-local-data

Quick Production Tip

Set both requests and limits. Requests ensure scheduler places the pod properly; limits prevent a single pod from starving others. Use monitoring (Prometheus) to create alerting on memory RSS close to limit (e.g., >80%).

---

Would you like me to continue with Scenario 5 (Service Not Accessible) and Scenario 6 (DNS/CoreDNS issues) next in the same format?

పేర్ల రఘురాం

Sunday, August 10, 2025

kuber

No comments:

Post a Comment

అనునిత్యం పరిశ్రమించు నాలా....

About Me