Sunday, August 10, 2025

50 best practices in kubernetes



---

Kubernetes Production Support – 50 Best Practices (Interview Edition)


---

A. Cluster & Node Management

1. Keep Kubernetes version up to date
❌ Old versions miss security patches → vulnerable cluster.
✅ Upgrade using kubeadm upgrade and plan downtime in stages.

kubectl version --short
kubeadm upgrade plan


2. Use multiple master nodes (HA)
❌ Single master = control plane outage if node fails.
✅ Deploy at least 3 masters in HA setup.
(YAML: kubeadm config with stacked etcd)


3. Label & taint nodes
❌ Workloads run on wrong nodes → performance/security risk.
✅ Use labels for scheduling, taints to block unwanted pods.

kubectl label node node1 role=db
kubectl taint nodes node1 dedicated=db:NoSchedule


4. Enable Cluster Autoscaler
❌ Manual scaling → delays & outages under load.
✅ Deploy autoscaler with cloud provider integration.

kubectl apply -f cluster-autoscaler.yaml


5. Reserve system resources
❌ Kubelet starved → node unstable.
✅ Add --system-reserved in kubelet config.
(kubelet config YAML)


6. Monitor node health
❌ Node failures unnoticed → pod downtime.
✅ Use kubectl get nodes + Prometheus alerts.

kubectl get nodes -o wide


7. Spread workloads across zones
❌ Zone outage takes all workloads down.
✅ Use topology spread constraints or node labels.

topologySpreadConstraints:
  - maxSkew: 1


8. Avoid overcommitting resources
❌ Pods evicted due to memory pressure.
✅ Monitor requests/limits ratio in Grafana.

kubectl top nodes


9. Tune OS/kernel for containers
❌ Network & disk latency issues.
✅ Enable cgroupv2, adjust sysctl params.

sysctl -w net.ipv4.ip_forward=1


10. Apply OS security updates
❌ Vulnerable kernel exploited.
✅ Automate patching with maintenance windows.

apt update && apt upgrade -y




---

B. Pod & Workload Management

11. Set resource requests/limits
❌ Pods hog resources → others throttled.
✅ Define CPU/memory in manifests.

resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi


12. Configure PodDisruptionBudgets
❌ All pods evicted during maintenance.
✅ Set minAvailable or maxUnavailable.

minAvailable: 2


13. Readiness/Liveness probes
❌ Unhealthy pods still receive traffic.
✅ HTTP/TCP probes in manifest.

livenessProbe:
  httpGet:
    path: /health
    port: 8080


14. Pod anti-affinity for critical apps
❌ Critical pods on same node → single point failure.
✅ Set requiredDuringSchedulingIgnoredDuringExecution.

podAntiAffinity: ...


15. Init containers for dependencies
❌ Main app starts before DB ready.
✅ Init container checks service availability.

initContainers: ...


16. Use correct controller type
❌ Stateful apps lose data with Deployments.
✅ Use StatefulSet for stateful workloads.


17. Lightweight, scanned images
❌ Large images slow deploy, vulnerabilities possible.
✅ Use trivy/grype for scans.


18. No root containers
❌ Privilege escalation risk.
✅ securityContext.runAsNonRoot: true.


19. Use imagePullPolicy=IfNotPresent
❌ Unnecessary image pulls → deploy delays.
✅ Set in manifests.


20. Version-tag images
❌ Latest tag causes inconsistent rollouts.
✅ Use semantic version tags.




---

C. Networking & Service Management

21. Right service type
❌ Exposing internal services publicly.
✅ ClusterIP internal, LoadBalancer/Ingress for external.


22. Secure Ingress with TLS
❌ Plaintext traffic vulnerable to sniffing.
✅ TLS cert in Ingress manifest.


23. NetworkPolicies
❌ Pods can talk to everything.
✅ Allow only required traffic.


24. No public API server
❌ Cluster takeover risk.
✅ Restrict via firewall/security groups.


25. Stable DNS via CoreDNS monitoring
❌ Service resolution failures.
✅ Alerts on CoreDNS pod health.


26. Headless services for Stateful workloads
❌ Stateful pods fail to discover peers.
✅ clusterIP: None in Service.


27. Connection timeouts/retries
❌ Hanging requests block clients.
✅ App-level configs + Istio retries.


28. externalTrafficPolicy=Local
❌ Client IP lost for logging.
✅ Set in Service manifest.


29. Limit public access
❌ Attackers exploit open services.
✅ Security groups + firewall rules.


30. Load-test before go-live
❌ Crashes under real traffic.
✅ Use k6/locust.




---

D. Observability & Troubleshooting

31. Prometheus + Grafana
❌ No performance visibility.
✅ Deploy kube-prometheus-stack.


32. Centralized logs (ELK/Loki)
❌ No log correlation during incidents.
✅ Fluentd/FluentBit collectors.


33. Enable audit logging
❌ No trace of API actions.
✅ API server --audit-log-path.


34. Alerts for restarts/resource issues
❌ Issues unnoticed until outage.
✅ Prometheus rules.


35. kubectl describe/logs
❌ Slow troubleshooting.
✅ Standard first step.


36. Runbooks
❌ Inconsistent incident handling.
✅ Confluence/Docs with steps.


37. kubectl top for bottlenecks
❌ Capacity issues unidentified.
✅ Resource tuning.


38. Distributed tracing
❌ Slow services hard to debug.
✅ Jaeger/OpenTelemetry.


39. Historical metrics
❌ No capacity planning data.
✅ Long-term storage in Thanos.


40. DR playbook testing
❌ Failover fails during disaster.
✅ Quarterly drills.




---

E. Security & Compliance

41. RBAC
❌ Users have excessive permissions.
✅ Role/RoleBinding per namespace.


42. Namespaces for isolation
❌ Cross-application interference.
✅ One namespace per app/team.


43. Image scanning
❌ Deploying vulnerable images.
✅ trivy CI scan.


44. Secrets management
❌ Credentials exposed in plain text.
✅ kubectl create secret.


45. Rotate secrets
❌ Stolen creds remain valid.
✅ Automate with Vault/KMS.


46. API auth & authorization
❌ Unauthorized cluster actions.
✅ Certs, tokens, OIDC.


47. Restrict kubectl exec
❌ Attackers run commands inside pods.
✅ PSP or RBAC deny.


48. CIS Benchmark
❌ Non-compliant cluster.
✅ kube-bench checks.


49. Admission controllers
❌ Bad manifests deployed.
✅ PodSecurity/ValidatingWebhook.


50. Periodic security audits
❌ Vulnerabilities stay unnoticed.
✅ kubescape scans.




---

If you want, I can now convert this into a single-page “Kubernetes Production Support Interview Cheatsheet” PDF so you can print and use before interviews.
Do you want me to prepare that?


No comments:

Post a Comment