Monday, August 11, 2025
kube
Sunday, August 10, 2025
50 best practices in kubernetes
kuber
kuber
best kuber
best practices in kubernetes
Here’s a 50-point crisp checklist of Kubernetes production support best practices you can use in interviews — short, sharp, and easy to remember.
Kubernetes Production Support – 50 Best Practices
A. Cluster & Node Management
- Keep Kubernetes version up to date with LTS and security patches.
- Use multiple master nodes for HA (High Availability).
- Label and taint nodes for workload segregation.
- Use autoscaling for nodes (Cluster Autoscaler).
- Reserve system resources on nodes using
--system-reserved
. - Regularly monitor node health via
kubectl get nodes
. - Spread workloads across zones/regions for resilience.
- Avoid overcommitting node resources beyond safe limits.
- Ensure OS and kernel are tuned for container workloads.
- Apply OS-level security updates on nodes regularly.
B. Pod & Workload Management
- Use resource requests and limits for all pods.
- Configure PodDisruptionBudgets to avoid downtime during maintenance.
- Use Readiness and Liveness probes for health checks.
- Implement pod anti-affinity to avoid co-locating critical workloads.
- Use init containers for dependency checks before main app starts.
- Deploy workloads via Deployment, StatefulSet, or DaemonSet as per use case.
- Keep images lightweight and scan for vulnerabilities.
- Avoid running pods as root.
- Use imagePullPolicy=IfNotPresent for stable deployments.
- Tag images with version, not
latest
.
C. Networking & Service Management
- Use ClusterIP for internal services, LoadBalancer/Ingress for external.
- Secure Ingress with TLS (Let’s Encrypt or custom certs).
- Use NetworkPolicies to control pod-to-pod communication.
- Avoid exposing the API server publicly.
- Keep DNS resolution stable via CoreDNS monitoring.
- Use headless services for Stateful workloads.
- Implement connection timeouts and retries in services.
- Configure
externalTrafficPolicy=Local
for preserving client IP. - Limit public access to services via firewalls or security groups.
- Load-test services before going live.
D. Observability & Troubleshooting
- Integrate Prometheus & Grafana for metrics monitoring.
- Centralize logs via ELK or Loki.
- Enable Kubernetes audit logging for API server.
- Set up alerts for pod restarts, CPU/memory saturation.
- Use
kubectl describe
andkubectl logs
for quick debugging. - Maintain runbooks for common incident scenarios.
- Use
kubectl top
to identify resource bottlenecks. - Set up traceability with OpenTelemetry/Jaeger.
- Store historical metrics for capacity planning.
- Regularly test disaster recovery (DR) playbooks.
E. Security & Compliance
- Enable Role-Based Access Control (RBAC).
- Use namespaces for workload isolation.
- Scan container images before deployment.
- Use secrets in Kubernetes Secret objects, not plain env variables.
- Rotate secrets and credentials periodically.
- Enable API authentication and authorization.
- Restrict
kubectl exec
access in production. - Use CIS Kubernetes Benchmark for compliance checks.
- Enable admission controllers (PodSecurity, ValidatingWebhook).
- Perform periodic security audits with tools like Kube-bench/Kubescape.
If you want, I can also convert these 50 points into an interview-ready “cheatsheet” in .docx format so you can print and revise quickly.
Do you want me to prepare that formatted docx for you now?
kubernetes crisp
Got it — you want crisp, 30–45 second interview-ready answers for each Kubernetes real-time scenario, so you can speak confidently without sounding like you’re reading a script.
I’ll make them in this format:
Scenario → Concise Answer (Why + How + Example in bullets, with imperative command if relevant)
1. Pods Stuck in CrashLoopBackOff
- Why: Container is repeatedly failing — bad config, missing env vars, wrong image.
- How: Describe how to check logs, events, and fix root cause.
- Example:
kubectl describe pod <pod> kubectl logs <pod> -c <container>
- Interview line: “First, I’d check
kubectl describe
andlogs
to see the failure reason, fix the image/config, then redeploy.”
2. Pod Pending (Scheduling Issue)
- Why: No matching node (resource shortage, node selector mismatch, taints).
- How: Check events, resource requests, node taints.
- Example:
kubectl describe pod <pod> kubectl get nodes --show-labels
- Interview line: “I’d inspect scheduling events, verify resource requests, and adjust node selectors or scale nodes.”
3. High CPU/Memory Usage
- Why: Container limits too low, memory leak, heavy workload.
- How: Use metrics, adjust resources, scale pods.
- Example:
kubectl top pod kubectl top node kubectl scale deploy <name> --replicas=5
- Interview line: “I’d use
kubectl top
to confirm high usage, then tune requests/limits or scale the deployment.”
4. Service Not Accessible
- Why: Service type mismatch, wrong selector, missing endpoints.
- How: Check service endpoints and pod labels.
- Example:
kubectl get svc kubectl describe svc <svc> kubectl get endpoints <svc>
- Interview line: “I’d verify the service has correct selectors and endpoints, then test connectivity from inside the cluster.”
5. Image Pull BackOff
- Why: Wrong image name, private repo without creds.
- How: Fix image reference or add
imagePullSecret
. - Example:
kubectl create secret docker-registry mysecret --docker-username=... --docker-password=...
- Interview line: “I’d confirm the image name, check registry access, and add an
imagePullSecret
if needed.”
If you want, I can prepare all 20 Lloyds scenarios in this exact 30–45 second bullet style with imperative commands so you have a single “speak-ready” document for the interview.
Do you want me to go ahead and prepare the full 20 in this short-answer style?
Alright, here’s a Kubernetes Troubleshooting Interview Quick-Answer Sheet — each scenario condensed into 30–45 seconds with key imperative commands you can rattle off in an interview without over-talking.
---
1. Pod Stuck in CrashLoopBackOff
Short Answer:
"Usually due to application crashes or bad configs. First, check logs and events to find the root cause, fix config, and redeploy."
Commands:
kubectl get pod <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous
kubectl describe pod <pod> -n <ns>
kubectl rollout restart deploy <deploy> -n <ns>
---
2. Pod Pending
Short Answer:
"Pending means the scheduler can’t place the pod — usually no matching nodes, insufficient resources, or affinity rules blocking scheduling."
Commands:
kubectl describe pod <pod> -n <ns>
kubectl get nodes -o wide
kubectl describe node <node>
---
3. ImagePullBackOff
Short Answer:
"Registry auth issues, wrong image name, or tag mismatch. Check events and fix image config or secret."
Commands:
kubectl describe pod <pod> -n <ns>
kubectl create secret docker-registry <name> \
--docker-server=<server> --docker-username=<user> \
--docker-password=<pass> --docker-email=<email>
kubectl patch serviceaccount default \
-p '{"imagePullSecrets":[{"name":"<name>"}]}'
---
4. Node Not Ready
Short Answer:
"Can be due to kubelet failure, disk pressure, network issue, or cloud node health problem. Check node status and kubelet logs."
Commands:
kubectl get nodes
kubectl describe node <node>
journalctl -u kubelet -f
---
5. High Pod Restarts
Short Answer:
"Often caused by memory/CPU limits, readiness probe failure, or app bugs. Analyze logs, events, and resource usage."
Commands:
kubectl logs <pod> -n <ns>
kubectl top pod <pod> -n <ns>
kubectl describe pod <pod> -n <ns>
---
6. OOMKilled
Short Answer:
"Container exceeded memory limit. Increase limits or optimize memory usage."
Commands:
kubectl describe pod <pod> -n <ns>
kubectl edit deploy <deploy> -n <ns>
---
7. Service Not Accessible
Short Answer:
"Check if service type and selector match pods, and ensure endpoints are created."
Commands:
kubectl get svc -n <ns>
kubectl describe svc <svc> -n <ns>
kubectl get endpoints <svc> -n <ns>
---
8. DNS Resolution Failing
Short Answer:
"Usually CoreDNS crash or misconfig. Check CoreDNS pods and logs."
Commands:
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system <coredns-pod>
---
9. PVC Pending
Short Answer:
"StorageClass mismatch or no available PV. Check PVC events and storage backend."
Commands:
kubectl describe pvc <pvc> -n <ns>
kubectl get sc
kubectl get pv
---
10. Deployment Not Updating
Short Answer:
"Wrong selector, failing readiness probe, or paused rollout. Check rollout status."
Commands:
kubectl rollout status deploy <deploy> -n <ns>
kubectl rollout resume deploy <deploy> -n <ns>
kubectl describe deploy <deploy> -n <ns>
---
If you want, I can expand this to all 20 Lloyds interview scenarios in the same short-answer + imperative command style so you can literally answer in 40 seconds per question in the interview w
ithout going blank.
Do you want me to prepare all 20 like this in one printable PDF? That will make it a perfect rapid-fire prep sheet.