పేర్ల రఘురాం: 50 best practices in kubernetes

---

Kubernetes Production Support – 50 Best Practices (Interview Edition)

---

A. Cluster & Node Management

1. Keep Kubernetes version up to date

❌ Old versions miss security patches → vulnerable cluster.

✅ Upgrade using kubeadm upgrade and plan downtime in stages.

kubectl version --short

kubeadm upgrade plan

2. Use multiple master nodes (HA)

❌ Single master = control plane outage if node fails.

✅ Deploy at least 3 masters in HA setup.

(YAML: kubeadm config with stacked etcd)

3. Label & taint nodes

❌ Workloads run on wrong nodes → performance/security risk.

✅ Use labels for scheduling, taints to block unwanted pods.

kubectl label node node1 role=db

kubectl taint nodes node1 dedicated=db:NoSchedule

4. Enable Cluster Autoscaler

❌ Manual scaling → delays & outages under load.

✅ Deploy autoscaler with cloud provider integration.

kubectl apply -f cluster-autoscaler.yaml

5. Reserve system resources

❌ Kubelet starved → node unstable.

✅ Add --system-reserved in kubelet config.

(kubelet config YAML)

6. Monitor node health

❌ Node failures unnoticed → pod downtime.

✅ Use kubectl get nodes + Prometheus alerts.

kubectl get nodes -o wide

7. Spread workloads across zones

❌ Zone outage takes all workloads down.

✅ Use topology spread constraints or node labels.

topologySpreadConstraints:

- maxSkew: 1

8. Avoid overcommitting resources

❌ Pods evicted due to memory pressure.

✅ Monitor requests/limits ratio in Grafana.

kubectl top nodes

9. Tune OS/kernel for containers

❌ Network & disk latency issues.

✅ Enable cgroupv2, adjust sysctl params.

sysctl -w net.ipv4.ip_forward=1

10. Apply OS security updates

❌ Vulnerable kernel exploited.

✅ Automate patching with maintenance windows.

apt update && apt upgrade -y

---

B. Pod & Workload Management

11. Set resource requests/limits

❌ Pods hog resources → others throttled.

✅ Define CPU/memory in manifests.

resources:

requests:

cpu: 200m

memory: 256Mi

limits:

cpu: 500m

memory: 512Mi

12. Configure PodDisruptionBudgets

❌ All pods evicted during maintenance.

✅ Set minAvailable or maxUnavailable.

minAvailable: 2

13. Readiness/Liveness probes

❌ Unhealthy pods still receive traffic.

✅ HTTP/TCP probes in manifest.

livenessProbe:

httpGet:

path: /health

port: 8080

14. Pod anti-affinity for critical apps

❌ Critical pods on same node → single point failure.

✅ Set requiredDuringSchedulingIgnoredDuringExecution.

podAntiAffinity: ...

15. Init containers for dependencies

❌ Main app starts before DB ready.

✅ Init container checks service availability.

initContainers: ...

16. Use correct controller type

❌ Stateful apps lose data with Deployments.

✅ Use StatefulSet for stateful workloads.

17. Lightweight, scanned images

❌ Large images slow deploy, vulnerabilities possible.

✅ Use trivy/grype for scans.

18. No root containers

❌ Privilege escalation risk.

✅ securityContext.runAsNonRoot: true.

19. Use imagePullPolicy=IfNotPresent

❌ Unnecessary image pulls → deploy delays.

✅ Set in manifests.

20. Version-tag images

❌ Latest tag causes inconsistent rollouts.

✅ Use semantic version tags.

---

C. Networking & Service Management

21. Right service type

❌ Exposing internal services publicly.

✅ ClusterIP internal, LoadBalancer/Ingress for external.

22. Secure Ingress with TLS

❌ Plaintext traffic vulnerable to sniffing.

✅ TLS cert in Ingress manifest.

23. NetworkPolicies

❌ Pods can talk to everything.

✅ Allow only required traffic.

24. No public API server

❌ Cluster takeover risk.

✅ Restrict via firewall/security groups.

25. Stable DNS via CoreDNS monitoring

❌ Service resolution failures.

✅ Alerts on CoreDNS pod health.

26. Headless services for Stateful workloads

❌ Stateful pods fail to discover peers.

✅ clusterIP: None in Service.

27. Connection timeouts/retries

❌ Hanging requests block clients.

✅ App-level configs + Istio retries.

28. externalTrafficPolicy=Local

❌ Client IP lost for logging.

✅ Set in Service manifest.

29. Limit public access

❌ Attackers exploit open services.

✅ Security groups + firewall rules.

30. Load-test before go-live

❌ Crashes under real traffic.

✅ Use k6/locust.

---

D. Observability & Troubleshooting

31. Prometheus + Grafana

❌ No performance visibility.

✅ Deploy kube-prometheus-stack.

32. Centralized logs (ELK/Loki)

❌ No log correlation during incidents.

✅ Fluentd/FluentBit collectors.

33. Enable audit logging

❌ No trace of API actions.

✅ API server --audit-log-path.

34. Alerts for restarts/resource issues

❌ Issues unnoticed until outage.

✅ Prometheus rules.

35. kubectl describe/logs

❌ Slow troubleshooting.

✅ Standard first step.

36. Runbooks

❌ Inconsistent incident handling.

✅ Confluence/Docs with steps.

37. kubectl top for bottlenecks

❌ Capacity issues unidentified.

✅ Resource tuning.

38. Distributed tracing

❌ Slow services hard to debug.

✅ Jaeger/OpenTelemetry.

39. Historical metrics

❌ No capacity planning data.

✅ Long-term storage in Thanos.

40. DR playbook testing

❌ Failover fails during disaster.

✅ Quarterly drills.

---

E. Security & Compliance

41. RBAC

❌ Users have excessive permissions.

✅ Role/RoleBinding per namespace.

42. Namespaces for isolation

❌ Cross-application interference.

✅ One namespace per app/team.

43. Image scanning

❌ Deploying vulnerable images.

✅ trivy CI scan.

44. Secrets management

❌ Credentials exposed in plain text.

✅ kubectl create secret.

45. Rotate secrets

❌ Stolen creds remain valid.

✅ Automate with Vault/KMS.

46. API auth & authorization

❌ Unauthorized cluster actions.

✅ Certs, tokens, OIDC.

47. Restrict kubectl exec

❌ Attackers run commands inside pods.

✅ PSP or RBAC deny.

48. CIS Benchmark

❌ Non-compliant cluster.

✅ kube-bench checks.

49. Admission controllers

❌ Bad manifests deployed.

✅ PodSecurity/ValidatingWebhook.

50. Periodic security audits

❌ Vulnerabilities stay unnoticed.

✅ kubescape scans.

---

If you want, I can now convert this into a single-page “Kubernetes Production Support Interview Cheatsheet” PDF so you can print and use before interviews.

Do you want me to prepare that?

పేర్ల రఘురాం

Sunday, August 10, 2025

50 best practices in kubernetes

No comments:

Post a Comment

అనునిత్యం పరిశ్రమించు నాలా....

About Me