పేర్ల రఘురాం: kubernetes

I am having 20+years experience and working in wellsfargo since nov 2021 for financial products team in wealth investment management department as a systems operations manager. Now i have 2 interviews in pepsico( Senior Manager, D&AI SRE, Hyper Care/Transition Engagement (DataOps) and they are using Azure,kubernetes,sre observability) and llyods ( they are looking for sre manager (gcp, kubernetes and pure sre)

Now prepare me for interview with all common realtime scenarios with detailed step by step in the format problem statement, why it occured , how to resolve with commands and also they may ask me to write yaml files on common scenarios with kubernetes commands

generate excel format with clear reference so that it will be easy to remember and prepare for interview include all common realtime scenarios for kuberenetes and sre with full yaml files and commands

Got it — you want 20 real-time Kubernetes production scenarios that could come up in an SRE/Support/Manager-level interview for Lloyds, with detailed answers showing:

What happened (scenario)

Why it happened (root cause)

How to troubleshoot & fix (steps & commands)

Prevention measures (long-term fix)

Real-world analogy (so you can explain clearly in interviews)

I’ll structure them so you can talk confidently at a manager level (impact/business risk) and also hands-on level (commands, configs).

---

20 Real-Time Kubernetes Scenarios – Detailed

---

1. Pods Stuck in CrashLoopBackOff

Scenario: Application pods keep restarting and never stay up.

Root Cause:

Application misconfiguration (wrong DB URL, missing env vars).

Liveness probe failing repeatedly.

Missing dependency service.

Troubleshooting & Fix:

kubectl get pods -n <namespace>

kubectl describe pod <pod-name> -n <namespace>

kubectl logs <pod-name> -n <namespace> --previous

Check container logs for stack trace.

Verify ConfigMap/Secret values.

If liveness probe issue: temporarily disable probe → fix app → re-enable.

Prevention:

Automated smoke tests before deployment.

Use readiness probe to prevent sending traffic to unhealthy pods.

Analogy: Like a shopkeeper who keeps opening the shop but immediately closing because the electricity is out.

---

2. Pods in ImagePullBackOff

Scenario: Deployment fails because the container image can’t be pulled.

Root Cause:

Wrong image name or tag.

Private registry without credentials.

Registry outage.

Troubleshooting & Fix:

kubectl describe pod <pod-name>

kubectl create secret docker-registry regcred \

--docker-server=<registry> --docker-username=<user> \

--docker-password=<pass> --docker-email=<email>

Update deployment to use imagePullSecrets.

Prevention:

Use image digest instead of latest.

Set up image pull secret at namespace level.

Analogy: Like ordering a product online but typing the wrong product ID or forgetting your password for the site.

---

3. Node Out of Memory (OOMKill)

Scenario: Pods keep getting terminated with OOMKilled.

Root Cause:

Container memory limit too low.

Memory leak in application.

Troubleshooting & Fix:

kubectl describe pod <pod-name> | grep -i OOM

kubectl top pod <pod-name>

Increase memory requests/limits in YAML.

Fix code memory leaks.

Prevention:

Use resource requests & limits properly.

Enable Prometheus alerts for memory usage > 80%.

Analogy: Like having too many people in a small lift — eventually, someone gets pushed out.

---

4. High CPU Usage Slowing Down Cluster

Scenario: Cluster becomes sluggish, API calls take longer.

Root Cause:

High CPU load from pods doing heavy computation.

Misconfigured resource limits.

Troubleshooting & Fix:

kubectl top nodes

kubectl top pod --sort-by=cpu

Identify heavy pods → scale out or optimize.

Prevention:

Set CPU limits and requests per service.

Use Horizontal Pod Autoscaler (HPA).

Analogy: Like too many people talking loudly in a small meeting room, making it hard to focus.

---

5. PersistentVolumeClaim Stuck in Pending

Scenario: PVC never gets bound.

Root Cause:

No matching StorageClass.

Insufficient storage on nodes.

Troubleshooting & Fix:

kubectl get pvc

kubectl describe pvc <pvc-name>

kubectl get sc

Ensure storage class matches PVC request.

Add storage or adjust claim size.

Prevention:

Define default StorageClass.

Monitor storage capacity with alerts.

Analogy: Like booking a hotel room size that the hotel doesn’t have.

---

6. Service Not Accessible Inside Cluster

Scenario: Another pod can’t reach a service using DNS name.

Root Cause:

Service selector mismatch.

Wrong port in Service definition.

CoreDNS crash.

Troubleshooting & Fix:

kubectl get svc

kubectl describe svc <service-name>

kubectl get endpoints <service-name>

Fix label selector.

Restart CoreDNS if required.

Prevention:

Use readiness probes to register healthy pods only.

Analogy: Like a phonebook entry pointing to the wrong phone number.

---

7. External Access Not Working

Scenario: Application is not reachable from outside cluster.

Root Cause:

Ingress misconfiguration.

Firewall rules blocking traffic.

Troubleshooting & Fix:

kubectl describe ingress <name>

kubectl get svc <svc-name>

Check ingress rules & DNS mapping.

Prevention:

Use Ingress controller health checks.

Analogy: Like giving someone your house number but forgetting to give them the gate key.

---

8. HPA Not Scaling Pods

Scenario: Load increases but pod count remains the same.

Root Cause:

Metrics server not working.

HPA target metrics not matching real load.

Troubleshooting & Fix:

kubectl get hpa

kubectl describe hpa

kubectl get pods -n kube-system | grep metrics-server

Restart metrics server.

Adjust CPU/memory thresholds.

Prevention:

Set realistic thresholds based on load tests.

Analogy: Like a waiter ignoring more customers entering the restaurant because he can’t see them.

---

9. ConfigMap Changes Not Reflecting

Scenario: You update a ConfigMap, but the pod still uses old values.

Root Cause:

Pods need restart to reload config (unless using volume mount with subPath).

Troubleshooting & Fix:

kubectl rollout restart deployment <deployment-name>

Prevention:

Use hot-reload mechanisms in the app.

Analogy: Like changing a recipe in the kitchen but the chef keeps using the old one until you hand him the new copy.

---

10. Secret Leaked in Logs

Scenario: Sensitive data appears in logs.

Root Cause:

Application logging full environment variables.

Troubleshooting & Fix:

Mask sensitive info in code.

Use Kubernetes Secrets with volume mounts instead of env vars.

Prevention:

Security reviews in CI/CD.

Analogy: Like accidentally shouting your ATM PIN in a crowded room.

---

11. Pod Scheduling Failure

Scenario: Pod stuck in Pending due to scheduling issues.

Root Cause:

Resource requests exceed node capacity.

Node taints preventing scheduling.

Troubleshooting & Fix:

kubectl describe pod <pod-name>

kubectl describe node <node-name>

Reduce requests or add more nodes.

Use tolerations if needed.

Analogy: Like booking a lorry for transport but your items don’t fit inside.

---

12. Readiness Probe Failing After Deployment

Scenario: Pods are running but not ready.

Root Cause:

Probe endpoint not responding within timeout.

Troubleshooting & Fix:

Increase initialDelaySeconds or timeoutSeconds.

Fix app startup slowness.

Prevention:

Tune probe settings after load testing.

Analogy: Like opening a shop but keeping the “Closed” sign on for too long.

---

13. Stale DNS Entries

Scenario: Service DNS resolves to old pod IPs.

Root Cause:

CoreDNS cache delay.

Troubleshooting & Fix:

kubectl rollout restart deployment coredns -n kube-system

Prevention:

Lower DNS TTL.

Analogy: Like using an old address to send a letter to someone who moved last week.

---

14. NetworkPolicy Blocking Traffic

Scenario: Services can’t talk to each other.

Root Cause:

Strict NetworkPolicy rules.

Troubleshooting & Fix:

kubectl get netpol

kubectl describe netpol <name>

Adjust rules to allow required namespaces and labels.

Prevention:

Maintain NetworkPolicy documentation.

Analogy: Like putting a security guard at your building who doesn’t let even your friends in.

---

15. Node NotReady

Scenario: Node marked NotReady in cluster.

Root Cause:

Kubelet crash.

Network partition.

Troubleshooting & Fix:

kubectl get nodes

journalctl -u kubelet

Restart kubelet service.

Prevention:

Node health monitoring with alerts.

Analogy: Like a delivery truck driver not answering his phone because the battery died.

---

16. DaemonSet Pods Missing

Scenario: Some nodes don’t have expected DaemonSet pods.

Root Cause:

Node selector mismatch.

Taints without tolerations.

Troubleshooting & Fix:

Update DaemonSet YAML with correct tolerations.

Prevention:

Use node affinity rules correctly.

Analogy: Like a security guard assigned to each gate but missing at one gate because the assignment list was wrong.

---

17. Application Rollback Needed

Scenario: New deployment causing failures, need rollback.

Troubleshooting & Fix:

kubectl rollout undo deployment <deployment-name>

Prevention:

Blue-green or canary deployments.

Analogy: Like serving a new dish to customers and switching back to the old menu after complaints.

---

18. Logs Missing After Pod Restart

Scenario: Need old logs but pod restarted.

Root Cause:

Logs stored only in container filesystem.

Troubleshooting & Fix:

Use centralized logging (ELK, Splunk).

Analogy: Like writing notes on a whiteboard and losing them when someone wipes it.

---

19. PVC Data Lost After Pod Restart

Scenario: App loses data after restart.

Root Cause:

Using emptyDir instead of PersistentVolume.

Fix:

Change to PVC-backed volume.

Analogy: Like keeping money in your pocket instead of a b

ank account.

---

20. Image Vulnerabilities Found

Scenario: Security scan flags critical CVEs in running image.

Root Cause:

Old base image.

Fix:

Build image from updated base.

Redeploy.

Prevention:

Integrate image scanning in CI/CD.

Analogy: Like discovering your house locks are outdated and easy to break.

---

If you want, I can now make this into a Lloyds-specific "K8s Interview Battle Card" PDF with commands + diagrams so you can use it like a quick reference during prep.

Do you want me to do that next?

Alright — I’ll finish scenarios 16–20 now so your “Lloyds Kubernetes Interview Battle Card” is complete, and then I’ll compile everything into a single clean PDF you can download and print.

---

🔸 16. PVC Data Lost After Pod Restart

Root Cause:

Using emptyDir or ephemeral storage instead of PersistentVolume

PVC incorrectly bound or deleted during redeploy

Troubleshooting & Fix:

kubectl get pvc -n <namespace>

kubectl describe pvc <pvc-name> -n <namespace>

Ensure volumeMounts point to persistent PVC

Restore from backup if data is lost

Prevention:

Use ReadWriteOnce or appropriate access modes with PVCs

Automated backup of critical volumes

Business Impact: 🔥 High Risk

Loss of financial transaction records or customer-uploaded documents.

Analogy: Like storing bank records on a whiteboard that gets wiped every night.

---

🔸 17. High CPU Usage Slowing Cluster

Root Cause:

Pods requesting too much CPU

No CPU limits set → noisy neighbor effect

Troubleshooting & Fix:

kubectl top nodes

kubectl top pods --all-namespaces

Identify top CPU consumers

Apply resource limits in deployment YAML

Prevention:

Enforce ResourceQuotas

Auto-scale based on CPU thresholds

Business Impact: ⚠️ Medium Risk

Core banking APIs slow down, increasing transaction times.

Analogy: Like one person in an office hogging all the internet bandwidth for video streaming.

---

🔸 18. OOMKilled Pods

Root Cause:

Memory limits too low

Memory leaks in application code

Troubleshooting & Fix:

kubectl describe pod <pod-name> -n <namespace>

Increase resources.limits.memory

Profile and fix memory leaks in code

Prevention:

Right-size limits in staging before prod deploys

Enable memory usage alerts

Business Impact: 🔥 High Risk

Applications crash under peak loads, disrupting payment gateways.

Analogy: Like overfilling a balloon until it bursts.

---

🔸 19. Service Not Accessible Internally

Root Cause:

Service selectors not matching pod labels

DNS entry missing in CoreDNS

Troubleshooting & Fix:

kubectl get svc -n <namespace>

kubectl describe svc <service-name> -n <namespace>

Correct selector labels

Restart CoreDNS if DNS is stale

Prevention:

Label verification in CI/CD pipelines

Service connectivity tests

Business Impact: ⚠️ Medium Risk

Internal APIs fail, breaking microservice chains.

Analogy: Like a receptionist not knowing which room a guest is in.

---

🔸 20. Image Vulnerabilities Found

Root Cause:

Outdated base images

Lack of container image scanning

Troubleshooting & Fix:

Run security scan (e.g., Trivy, Aqua, Anchore)

Patch vulnerable packages

Rebuild and redeploy

Prevention:

Integrate scanning into CI/CD

Maintain approved base image registry

Business Impact: 🔥 High Risk

Regulatory compliance breach, risk of cyber-attack.

Analogy: Like serving food from an expired stock.

---

✅ Now that all 20 scenarios are complete, I’ll combine them with the intro, table of contents, and formatting into a single PDF so you can download and use it as your Lloyds interview battle card.

Do you want me to st

yle the PDF with headings, bullet points, and code blocks exactly like this chat so it looks polished and easy to read? That way, you can directly take it into the interview.

పేర్ల రఘురాం

Saturday, August 9, 2025

kubernetes

No comments:

Post a Comment

అనునిత్యం పరిశ్రమించు నాలా....

About Me