---
Kubernetes Production Support – 50 Best Practices (Interview Edition)
---
A. Cluster & Node Management
1. Keep Kubernetes version up to date
❌ Old versions miss security patches → vulnerable cluster.
✅ Upgrade using kubeadm upgrade and plan downtime in stages.
kubectl version --short
kubeadm upgrade plan
2. Use multiple master nodes (HA)
❌ Single master = control plane outage if node fails.
✅ Deploy at least 3 masters in HA setup.
(YAML: kubeadm config with stacked etcd)
3. Label & taint nodes
❌ Workloads run on wrong nodes → performance/security risk.
✅ Use labels for scheduling, taints to block unwanted pods.
kubectl label node node1 role=db
kubectl taint nodes node1 dedicated=db:NoSchedule
4. Enable Cluster Autoscaler
❌ Manual scaling → delays & outages under load.
✅ Deploy autoscaler with cloud provider integration.
kubectl apply -f cluster-autoscaler.yaml
5. Reserve system resources
❌ Kubelet starved → node unstable.
✅ Add --system-reserved in kubelet config.
(kubelet config YAML)
6. Monitor node health
❌ Node failures unnoticed → pod downtime.
✅ Use kubectl get nodes + Prometheus alerts.
kubectl get nodes -o wide
7. Spread workloads across zones
❌ Zone outage takes all workloads down.
✅ Use topology spread constraints or node labels.
topologySpreadConstraints:
- maxSkew: 1
8. Avoid overcommitting resources
❌ Pods evicted due to memory pressure.
✅ Monitor requests/limits ratio in Grafana.
kubectl top nodes
9. Tune OS/kernel for containers
❌ Network & disk latency issues.
✅ Enable cgroupv2, adjust sysctl params.
sysctl -w net.ipv4.ip_forward=1
10. Apply OS security updates
❌ Vulnerable kernel exploited.
✅ Automate patching with maintenance windows.
apt update && apt upgrade -y
---
B. Pod & Workload Management
11. Set resource requests/limits
❌ Pods hog resources → others throttled.
✅ Define CPU/memory in manifests.
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
12. Configure PodDisruptionBudgets
❌ All pods evicted during maintenance.
✅ Set minAvailable or maxUnavailable.
minAvailable: 2
13. Readiness/Liveness probes
❌ Unhealthy pods still receive traffic.
✅ HTTP/TCP probes in manifest.
livenessProbe:
httpGet:
path: /health
port: 8080
14. Pod anti-affinity for critical apps
❌ Critical pods on same node → single point failure.
✅ Set requiredDuringSchedulingIgnoredDuringExecution.
podAntiAffinity: ...
15. Init containers for dependencies
❌ Main app starts before DB ready.
✅ Init container checks service availability.
initContainers: ...
16. Use correct controller type
❌ Stateful apps lose data with Deployments.
✅ Use StatefulSet for stateful workloads.
17. Lightweight, scanned images
❌ Large images slow deploy, vulnerabilities possible.
✅ Use trivy/grype for scans.
18. No root containers
❌ Privilege escalation risk.
✅ securityContext.runAsNonRoot: true.
19. Use imagePullPolicy=IfNotPresent
❌ Unnecessary image pulls → deploy delays.
✅ Set in manifests.
20. Version-tag images
❌ Latest tag causes inconsistent rollouts.
✅ Use semantic version tags.
---
C. Networking & Service Management
21. Right service type
❌ Exposing internal services publicly.
✅ ClusterIP internal, LoadBalancer/Ingress for external.
22. Secure Ingress with TLS
❌ Plaintext traffic vulnerable to sniffing.
✅ TLS cert in Ingress manifest.
23. NetworkPolicies
❌ Pods can talk to everything.
✅ Allow only required traffic.
24. No public API server
❌ Cluster takeover risk.
✅ Restrict via firewall/security groups.
25. Stable DNS via CoreDNS monitoring
❌ Service resolution failures.
✅ Alerts on CoreDNS pod health.
26. Headless services for Stateful workloads
❌ Stateful pods fail to discover peers.
✅ clusterIP: None in Service.
27. Connection timeouts/retries
❌ Hanging requests block clients.
✅ App-level configs + Istio retries.
28. externalTrafficPolicy=Local
❌ Client IP lost for logging.
✅ Set in Service manifest.
29. Limit public access
❌ Attackers exploit open services.
✅ Security groups + firewall rules.
30. Load-test before go-live
❌ Crashes under real traffic.
✅ Use k6/locust.
---
D. Observability & Troubleshooting
31. Prometheus + Grafana
❌ No performance visibility.
✅ Deploy kube-prometheus-stack.
32. Centralized logs (ELK/Loki)
❌ No log correlation during incidents.
✅ Fluentd/FluentBit collectors.
33. Enable audit logging
❌ No trace of API actions.
✅ API server --audit-log-path.
34. Alerts for restarts/resource issues
❌ Issues unnoticed until outage.
✅ Prometheus rules.
35. kubectl describe/logs
❌ Slow troubleshooting.
✅ Standard first step.
36. Runbooks
❌ Inconsistent incident handling.
✅ Confluence/Docs with steps.
37. kubectl top for bottlenecks
❌ Capacity issues unidentified.
✅ Resource tuning.
38. Distributed tracing
❌ Slow services hard to debug.
✅ Jaeger/OpenTelemetry.
39. Historical metrics
❌ No capacity planning data.
✅ Long-term storage in Thanos.
40. DR playbook testing
❌ Failover fails during disaster.
✅ Quarterly drills.
---
E. Security & Compliance
41. RBAC
❌ Users have excessive permissions.
✅ Role/RoleBinding per namespace.
42. Namespaces for isolation
❌ Cross-application interference.
✅ One namespace per app/team.
43. Image scanning
❌ Deploying vulnerable images.
✅ trivy CI scan.
44. Secrets management
❌ Credentials exposed in plain text.
✅ kubectl create secret.
45. Rotate secrets
❌ Stolen creds remain valid.
✅ Automate with Vault/KMS.
46. API auth & authorization
❌ Unauthorized cluster actions.
✅ Certs, tokens, OIDC.
47. Restrict kubectl exec
❌ Attackers run commands inside pods.
✅ PSP or RBAC deny.
48. CIS Benchmark
❌ Non-compliant cluster.
✅ kube-bench checks.
49. Admission controllers
❌ Bad manifests deployed.
✅ PodSecurity/ValidatingWebhook.
50. Periodic security audits
❌ Vulnerabilities stay unnoticed.
✅ kubescape scans.
---
If you want, I can now convert this into a single-page “Kubernetes Production Support Interview Cheatsheet” PDF so you can print and use before interviews.
Do you want me to prepare that?
No comments:
Post a Comment