Wednesday, November 26, 2025

llyods cyber

How you integrate pipeline?
What type of challenges you faced when doing this pipeline?
How you improved pipelines?
What are the steps you follow when you want to decomm any asset or application?
Explain me 2-3 complex scenarios where your involvement is comtributed?
What decisions you took and gave any long term benefits with your decision?
Llyods have 5 values we put customers first,be bold,inclusive,we are trust driven,we put community together.Explain each value with a real time scenario


Story telling
===========

Below is a storytelling script for each question, written in a conversational, confident, leadership-oriented tone.
You can speak this exactly in the interview — it sounds natural, senior, and impressive.


---

🎤 1. “How do you integrate a CI/CD pipeline?” — Storytelling Version

Story Script:

“When I integrate a CI/CD pipeline, I treat it like onboarding a new team member — it should be reliable, predictable, and add value from Day 1.

In one of my previous projects, we were modernising a legacy Wealth Management application. Developers were pushing code manually, deployments took hours, and production had frequent issues.

So I integrated a full CI/CD pipeline:

When developers committed code, the build automatically started

Unit tests, code quality, and security scans were triggered

Container images were packaged and stored in an artifact registry

Deployments to OpenShift were handled through GitOps using ArgoCD

Production deployments went via canary rollout based on error rates and latency


The entire process became automated end-to-end.
Earlier, a release used to take 4 hours. After CI/CD, it took 15 minutes.

So my approach is simple:
Build → Scan → Test → Package → Deploy → Observe → Rollback
All backed with automation and quality gates.”


---

🎤 2. “What are the stages of a CI/CD pipeline?” — Storytelling Version

Story Script:

“I explain pipeline stages just like a quality-controlled assembly line in a car factory.

1. Code Stage – Engineers write and review the code.


2. Integration Stage – Code is built, unit tested, and checked for quality.


3. Security Stage – SAST, SCA, container scans.


4. Delivery Stage – App is deployed to DEV/UAT automatically.


5. Governance Stage – Approvals, change management, compliance checks.


6. Production Stage – Canary or blue–green with observability gates.


7. Feedback Stage – Metrics, SLOs, DORA metrics.



Just like each step in a factory ensures no faulty part moves forward, each pipeline stage ensures no faulty code goes to production.”


---

🎤 3. “What challenges did you face in pipelines?” — Storytelling Version

Story Script:

“One major challenge I faced was inconsistent environments.
The pipeline passed in Jenkins, but failed in OpenShift due to different versions of runtime libraries.

Another challenge was long execution times — our test suite took 45 minutes, and developers were waiting idle.

Security scans also slowed things down; they used to run only at the end, leading to last-minute security blockers.

Also, we faced configuration drift — what was defined in Git didn’t match what was running in production.

These real-world issues taught me that pipelines need as much care as production systems.”


---

🎤 4. “How did you improve your pipelines?” — Storytelling Version

Story Script:

“I improved our pipelines by introducing three transformative changes.

First, I created reusable pipeline templates — any new team could adopt CI/CD in hours instead of days.

Second, I parallelized tasks. Earlier, unit tests, linting, and security scans were sequential; I made them run together. Build time reduced by more than half.

Third, I introduced GitOps so deployments were traceable, auditable, and rollback-ready.

As a result, our deployment frequency increased, failures dropped, and teams became more confident.”


---

🎤 5. “What steps do you follow to decommission an application?” — Storytelling Version

Story Script:

“I follow a structured 7-step decommission framework.

Recently, we had to retire a legacy mutual funds reporting system. First, I validated ownership and mapped dependencies — we discovered several hidden AutoSys jobs and two BI dashboards relying on it.

Second, we performed a risk assessment and agreed on a cutover plan with business, security, and architecture.

Third, we archived all historical data as per retention policy.

Fourth, we disabled traffic, removed DNS, deleted OpenShift deployments, databases, pipelines, and rotated certificates.

Finally, we updated CMDB and documented cost savings.

This structured approach ensured zero customer impact and full compliance.”


---

🎤 6. “Explain 2–3 complex scenarios where your contribution mattered.” — Storytelling Version


---

Scenario 1 — Kubernetes Probe Misconfiguration Causing DB Outage

“Once, we had an outage because a liveness probe was triggering a heavy DB call every 30 seconds.
Pods kept restarting… the DB connection pool got exhausted… customers couldn’t log in.

When I joined the bridge:

I identified the probe design as the root cause

Created a lightweight /healthz endpoint

Tuned readiness and liveness probes

Added connection pooling limits

Documented the best practice for all teams


Post fix — zero incidents. This became a bank-wide learning.”


---

Scenario 2 — Kafka MTLS Certificate Chain Issue

“In another case, after a certificate rotation, Kafka producers couldn’t talk to brokers. TLS handshake failed and message lag shot up.

I detected that the intermediate CA was missing in the truststore.

I automated certificate chain validation in CI and set up early expiry alerts.
What was a recurring problem turned into a zero-incident domain for the entire year.”


---

Scenario 3 — Ansible Zero-Downtime Patching

“We were having 2 hours downtime every month during patching for 150 apps.

I built a rolling-patching pipeline using Ansible:

Cordon a node

Drain workloads

Patch

Health-check

Move to next node


We achieved 100% zero downtime.
Teams kept their SLAs intact and business loved it.”


---

🎤 7. “What decisions did you take that gave long-term benefits?” — Storytelling Version

Story Script:

“One long-term decision was moving the whole organisation to GitOps.
Before GitOps, configurations drifted and debugging took forever.
After GitOps, everything was traceable and consistent.

Another decision was automating certificate lifecycle — earlier, certificate expiry was a repeating Sev1 issue. After automation, we never had another TLS outage.

Finally, I enforced an SRE Production Readiness Checklist — covering probes, autoscaling, monitoring, dashboards, SLOs, and alerting.
This single framework reduced Sev1 incidents by 40%.”


---

🎤 8. Lloyds 5 Values — Storytelling Version


---

1. We Put Customers First — Story Storytelling

“One evening, a high-risk release was planned.
My team was ready, but monitoring showed latency spikes.
If we deployed, customers might face login failures.

I took the call to postpone the release, even though everything was prepared.

Yes, it delayed our timeline, but it protected customer experience.
This is what putting customers first means to me.”


---

2. Be Bold — Storytelling

“When teams hesitated to adopt GitOps, I sensed fear of the unknown.

I created a small pilot with one team, proved deployment time reduction, and showed success with real data.

Taking that bold step triggered a cultural shift — within three months, four squads adopted GitOps.”


---

3. Inclusive — Storytelling

“I noticed junior engineers struggling with Kubernetes YAMLs.

Instead of leaving them behind, I made reusable templates and conducted hands-on sessions.

Suddenly everyone — junior, senior, QA — could contribute to deployments.

That’s inclusion in action: making everyone part of the journey.”


---

4. Trust Driven — Storytelling

“We once had a production outage because of a manual config change.

Instead of blaming the engineer, I conducted a blameless postmortem — focusing on the process, not the person.

We improved automation, added guardrails, and the engineer felt supported, not punished.

Trust grows when people feel safe.”


---

5. Put Community Together — Storytelling

“In my previous organisation, platform teams worked in silos — Kafka, DB, SRE, OpenShift.

During incidents, alignment was painful.

I created a weekly ‘Production Readiness Community’ where all teams shared learnings and best practices.

Within weeks, coordination became smooth, incidents reduced, and we acted like one community — not isolated islands.”


-------‐--------‐-----‐-------'xx--'xxx----------------

Below is a complete, polished, interview-ready script for all the questions Lloyds asked you today.
These answers are structured, leadership-oriented, SRE-centric, and aligned to Lloyds’ culture and values.


---

✅ 1. How do you integrate a CI/CD pipeline?

Answer (Manager-Level):

“I integrate a CI/CD pipeline by treating it as a product—with clear stages, automated governance, security gates, and measurable outcomes. My approach is:**

1. Requirements & Architecture

Understand branching strategy (GitFlow / trunk-based).

Define environments (dev → test → UAT → prod).

Identify required gates (security scans, approvals, change management).


2. Build Pipeline Integration (CI)

Source code triggers build (Git webhook).

Run:

Unit tests

Static code analysis (SonarQube)

Dependency scanning (Snyk)

Container scanning (Trivy / Aqua)



3. Packaging & Artifact Management

Build container images

Store in artifact registry (ECR, JFrog, Nexus)


4. Deployment Pipeline Integration (CD)

Deploy to lower environments using:

ArgoCD / Tekton / Jenkins / Azure DevOps


Apply GitOps methodology for auditability.


5. Observability & Rollback

Implement logging, metrics, traces

Canary / blue–green deployments

Automated rollback if SLO degradation


6. Governance

Align with change management, security, and SRE production readiness.



---

✅ 2. What are different stages of a CI/CD pipeline?

Use this crisp framework:

1. Code Stage

Branching

Peer review

Pre-commit checks


2. Continuous Integration Stage

Build

Unit tests

Linting

Static code analysis


3. Security Stage

SAST

SCA / license scanning

Container scanning

Secrets detection


4. Continuous Delivery Stage

Deploy to DEV / QA

Integration testing

E2E validation


5. Approvals & Governance

Change management

Security sign-off

SRE production readiness checklist


6. Production Deployment

Canary / blue-green

Observability validation

Automated rollback


7. Continuous Feedback

Metrics

DORA KPIs

Post-deployment review



---

✅ 3. What challenges have you faced in pipelines?

Give real-world SRE challenges:

Challenge 1: Pipeline failures due to inconsistent environments

Example: Jenkins agent versions differed → builds succeeded in some nodes but failed in others.


Challenge 2: Long pipeline execution times

Testing stage took 45 minutes → delayed deployments.


Challenge 3: Security delays

SAST scans took 20 minutes → developers bypassed security gates.


Challenge 4: Drift between Git manifests & actual cluster state

ArgoCD out of sync → wrong configurations in production.


Challenge 5: Secrets accidentally committed

Sensitive configs leaked → required secret scanning and vault integration.



---

✅ 4. How did you improve the pipelines?

Improvement 1: Standardized reusable pipeline templates

Common YAML templates for all teams

Reduced onboarding time by 40%


Improvement 2: Introduced parallel execution

Unit tests + static analysis + scans ran in parallel

Reduced build time from 45 → 18 minutes


Improvement 3: Shift-left security

Pre-commit SAST scans

Secrets scanning hooks

Prevented security delays later


Improvement 4: Introduced GitOps

ArgoCD-based automated deployments

No manual interventions

Eliminated configuration drift


Improvement 5: End-to-end observability gates

Added quality gates based on:

Latency

Error rate

Resource usage


Automated rollback enabled



---

✅ 5. What steps do you follow to decommission any asset/application? (Enterprise-grade answer)

Use a 7-step decommission framework:

Step 1: Identify & Validate

Inventory CMDB

Ownership confirmation

Understand dependencies (API, DB, downstream jobs)


Step 2: Risk Assessment

Impact analysis

Compliance checks

Data retention requirements


Step 3: Communication & Approvals

Notify stakeholders

Get sign-off from:

Product owner

Business

Security

Architecture



Step 4: Freeze & Cutover

Disable traffic

DNS removal

Load balancers update

Stop CRON/AutoSys jobs


Step 5: Data Archival

Backup DB

Move logs to long-term storage


Step 6: Infra Decommission

Delete:

Kubernetes deployments

VMs

Databases

Certificates

Secrets

Pipelines

Storage volumes

Service accounts



Step 7: Audit & CMDB Update

Update CMDB

Cost saving documented

Final audit closure



---

✅ 6. Explain 2–3 complex scenarios where your involvement contributed

Scenario 1: OpenShift – Liveness Probe Misconfiguration Causing Outages

Problem: App restarts every 30 seconds → DB connection pool exhausted → downtime.
Root cause: Liveness probe hitting an expensive DB call.

My Contribution:

Changed probe to a lightweight /healthz endpoint

Increased initialDelaySeconds

Added connection pool limits

Set readinessProbe to prevent serving traffic until warm-up completed


Outcome:

0 outages after fix

Reduced DB CPU by 60%

Became a best practice across all squads



---

Scenario 2: Kafka Cluster MTLS Failure – Production Message Lag

Problem: Producer → Broker TLS handshake failed after certificate rotation.
Impact: Lag grew → order processing delayed by 40 minutes.

My Contribution:

Diagnosed that intermediate CA certificate missing in truststore

Automated certificate chain validation in CI

Implemented cert expiry alerts + rotation pipeline


Outcome:

No TLS incidents for 12 months

Reduced TLS outage risk by 90%



---

Scenario 3: Ansible Pipeline – Zero-Downtime Patching

Problem: Weekly patching caused 2 hours downtime for 150 applications.

My Contribution:

Built an Ansible rolling patching pipeline

Patch node → cordon → drain → upgrade → uncordon

Validated health checks before moving to next node


Outcome:

Achieved 100% zero-downtime patching

Saved ~50 hours/month operational effort

Became enterprise standard for patching



---

✅ 7. What decisions did you take that gave long-term benefits?

Decision 1: Moving to GitOps

Ensured infra + app config always stays in sync

Eliminated manual deployments

Long-term benefit: More stable, secure releases


Decision 2: Standardizing SRE Production Readiness Checklist

TLS, probes, scaling, autoscaling, logging, dashboards, SLOs

Long-term benefit: Reduced Sev1 incidents by 40%


Decision 3: Implementing Certificate Lifecycle Automation

Auto-renewal, auto-verification, auto-deployment

Long-term benefit: Zero certificate outages for a full year


Decision 4: Introducing DORA metrics & error budgets

Data-driven reliability

Long-term benefit: Improved change success rate by 30%



---

✅ 8. Lloyds 5 Values — Explain each with a real-time scenario


---

1. We Put Customers First

Scenario:
During OpenShift migration, a release was planned on a Friday evening.
Risk of outage was high.

My action:
I postponed release and asked for canary testing on Monday.

Impact:

Prevented weekend customer disruption

Reinforced customer trust



---

2. Be Bold

Scenario:
Two squads resisted GitOps fearing complexity.

My action:
I piloted GitOps with one team, proved deployment time reduction, then rolled it out.

Impact:

Faster deployments

Bold decision → changed culture across teams



---

3. Inclusive

Scenario:
Junior engineers struggled with Kubernetes YAMLs.

My action:
I built low-code templates and conducted 3 hands-on sessions.

Impact:

All skill levels contributed

Improved team velocity

Everyone felt part of the transformation



---

4. Trust Driven

Scenario:
We once had an incident where a manual config change broke production.

My action:
Instead of blaming, I conducted a blameless postmortem, focused on fixing the process.

Impact:

Team trusted leadership

We built automated config guards & audits



---

5. Put Community Together

Scenario:
Different platform teams (Kafka, OpenShift, SRE, DB, Security) operated in silos.

My action:
Created a weekly “Production Readiness Community Call” with cross-team collaboration.

Impact:

Shared best practices

Faster issue resolution

Higher reliability culture



---

⭐ Final Tip

If these points come naturally in your interview, you will immediately stand out as a senior technology leader with strong SRE + cybersecurity alignment, exactly what Lloyds looks for.


---

If you want, I can also prepare:
✔ A 2-minute introduction mixing SRE + cybersecurity + business value
✔ A storytelling script
✔ Mindmaps for easy revision
✔ A condensed one-page cheat sheet for tomorrow’s round


=================================
How do you establish the trust and make team comfortable 





Here is a perfect storytelling leadership answer that will impress Lloyds, especially because one of their core values is “We are trust-driven.”


---

🎤 How do you establish trust and make your team comfortable?

⭐ Storytelling, senior-leader, people-first answer

Story Script:

“I build trust the same way I build reliable systems — with consistency, transparency, and psychological safety.

One example from my previous team:
When I joined as a manager, the team had just come out of a stressful production incident where a junior engineer was blamed publicly for a configuration mistake.
Morale was low, and people were afraid to take ownership.

I knew trust would not come through policies — it had to come through behaviors.

So I started with three simple actions:

1. I made the environment psychologically safe

In the very next incident, instead of asking ‘Who did this?’, I asked:
‘What in the process allowed this issue to happen?’
People immediately felt safe to speak.
I made it clear:
Mistakes are learning points, not weapons.

2. I became fully transparent

Every week, I held a short session where I shared

what decisions were being made

why we were doing certain changes

what risks we were facing

how we could improve reliability
Teams felt included, not dictated to.


3. I showed trust before asking for trust

I started delegating real ownership —
like letting a mid-level engineer lead a production change with me shadowing quietly in the background.
When people see that you trust them, they rise to the occasion.

Slowly, the culture shifted.

Engineers who were silent earlier started speaking confidently.
People began volunteering for complex tasks.
Most importantly, they trusted one another — not just me.

And when the next major incident occurred, the team handled it collaboratively, calmly, and without fear.

To me, that’s how trust is built:
By showing respect, giving transparency, empowering people, and consistently standing by them especially during failures.
That’s how you transform a group of individuals into a strong, comfortable, high-performing team.”


---

⭐ Quick punchline (use in interview if needed)

“Trust is not built during the good days — it’s built during failures.
If the team sees you stand by them in tough moments, they will run with you during transformations.”


---

If you want, I can also give:
✔ A 1-minute crisp version
✔ A values-aligned version for Lloyds
✔ A real incident-based version from SRE / production support


Monday, November 24, 2025

llyods

Awesome, let’s lock this in so you can walk into the Lloyds SRE Manager interview with full confidence. I’ll give you:

1. A 30-minute storytelling mock interview script


2. 50 real-time scenarios (SRE, Kubernetes, observability, incidents) in a compact format


3. A downloadable cheat-sheet PDF you can revise quickly




---

1️⃣ 30-Minute Storytelling Mock Interview Script

Use this like a rehearsal. Read questions aloud and answer using the suggested flow.

Q1. “Tell me about yourself” (2–3 minutes)

Your flow:

> “I’m Raghuram, with 20+ years of experience in production support and SRE in banking and wealth management.

Currently I work at Wells Fargo as a Systems Operations Manager, responsible for more than 150+ business-critical applications – including trading platforms and financial products.

My core focus areas are:

Keeping platforms reliable and compliant using SRE practices

Running stable Kubernetes/OpenShift-based platforms

Strong incident/change/problem management

Building observability using Prometheus, Grafana, APM, and logs

Automation with CI/CD and removing manual, risky steps.


I see SRE and security as two sides of the same mission: protecting customer trust. Every outage is also a risk event, and every manual step is a control gap.

At Lloyds, I want to bring this mix of reliability, risk-thinking, and automation to help teams deliver faster, but safely and predictably.”




---

Q2. “Tell me about a major incident you led” (4–5 minutes)

Story to use: Kubernetes probe misconfiguration → DB connection pool exhaustion → DoS-like risk

Structure (STAR):

Situation:
“We had a trading application in wealth management running on Kubernetes. During peak hours, users saw random freezes of 5–7 minutes.”

Task:
“As the SRE/operations lead, I owned incident management, technical coordination, and restoring confidence.”

Action:

Correlated latency spikes in Grafana with APM traces.

Saw Kafka consumer lag increasing and DB connection pool hitting max.

Discovered readiness probe declared pods ‘ready’ before full Kafka/DB initialization.

Fixed probes, tuned DB pool, added rate limiting and PodDisruptionBudget.

Treated it as potential DoS-pattern – added alerts on abnormal resource spikes.


Result:

Issue eliminated; no repeat incidents.

Peak latency reduced by ~30–40%.

We adopted a standard “probe hardening” checklist for all services.


Cyber angle (say this explicitly):
“We classified this not just as a performance bug but as a risk scenario – resource starvation can be abused like a DoS. So we added preventive guardrails and better monitoring around that behavior.”



---

Q3. “How do you manage change risk in production?” (4–5 minutes)

Story to use: Manual DB change → schema drift → GitOps & CI/CD controls

Situation:
“Weekend batch jobs for a wealth management system failed right after a release.”

Task:
“I had to restore the batch, identify what slipped through, and fix the process.”

Action:

Traced failure to schema differences between UAT and PROD.

Found a manual DB patch had been applied directly in PROD, outside the pipeline.

Introduced:

GitOps (ArgoCD) for config and infra

DB schema validation as a pre-deployment step

Mandatory approvals & RBAC-based deployment rights

Audit trail of who changed what, when.



Result:

Eliminated schema-drift incidents in that platform.

Reduced release-related P1s significantly (you can say “by more than half, over the next quarter”).


Cyber angle:
“We treated config drift as both a stability and a control violation. After this, any change not traceable back to Git and CI/CD was simply not allowed.”



---

Q4. “How do you use SLOs, monitoring, and alerts?” (4–5 minutes)

Start with 4 Golden Signals: latency, traffic, errors, saturation.

Say you standardized:

Prometheus + Grafana dashboards for all critical apps

Application APM + logs (e.g., Splunk/ELK)

SLOs with error budgets (e.g., 99.9% availability for trading APIs).



Mini-story:

> “One platform used to raise alerts only on full outages. We changed that:

Defined SLOs for latency and error rates.

Set up warning alerts at 50% error budget consumption.

During one release, we saw early error-rate increase, auto-rollback triggered by pipeline.


This prevented a full-blown incident and protected both customer experience and risk exposure.”



Link back to security:

> “Strong observability also helps detect suspicious or abnormal behavior early. For us, performance anomalies are also potential threat indicators.”




---

Q5. “How do you lead war rooms and RCAs?” (4–5 minutes)

Talk through your leadership style:

Keep call structured:

Who’s on bridge? App, infra, DB, network, security if needed.

Timeboxing: 5–10 min data-collection, then decision.


Keep it blameless but accountable:

Facts first, no finger-pointing.

After incident: RCA – what failed in design, process, or controls.


Convert RCAs into:

Runbooks

Automated checks

Standard patterns (for probes, config, capacity, etc.)



Drop a short example:

> “After repetitive JVM outages, instead of blaming dev, we introduced memory leak checks in pre-prod load testing and created standard JVM tuning templates. Incidents reduced drastically.”




---

Q6. “How do you align SRE with security teams?” (3–4 minutes)

Key talking points:

“We treat availability, integrity, and confidentiality together — not separately.”

Examples:

RBAC for Kubernetes & OpenShift – only pipelines can deploy to prod.

No direct DB changes; all through approved scripts in Git.

Regular involvement of security in:

DR drills

Change governance

Arch reviews



Share a mini-story:
“For one platform, we tagged certain alerts as ‘high-risk’ (e.g., sudden CPU/memory spikes, strange traffic patterns) and these were automatically routed to both SRE and security on-call lists.”



---

Q7. “Why Lloyds and this SRE Manager role?” (2–3 minutes)

Tie it back to:

Large, regulated bank → strong focus on risk & controls

Your experience in:

Wealth management

Trading platforms

Complex production ecosystems


Say:


> “I’m excited by the chance to bring my experience handling 150+ critical applications into a place like Lloyds, where reliability and regulatory expectations are high. I enjoy building teams that can handle incidents calmly, automate aggressively, and work hand-in-hand with security to protect customer trust.”




---

2️⃣ 50 Real-Time Scenarios (Compact for Revision)

Each scenario: Problem → Detection → Root Cause → Mitigation

I’ll group them by theme so you can remember better.

A. Kubernetes / OpenShift (1–10)

1. Misconfigured readiness probe causing DB pool exhaustion

Detected via latency and DB connection graphs.

Root cause: probe marked pod ready too early.

Fix: correct probe, tune pool, add PDB and rate limits.



2. Liveness probe killing pods during slow GC

Detected via frequent restarts.

Root: aggressive timeout during full GC.

Fix: increase timeouts, add startupProbe, tune JVM.



3. Node not draining properly before maintenance

Detection: 5xx errors during node patching.

Root: missing PodDisruptionBudget.

Fix: define PDB, use cordon+drain with controlled eviction.



4. Pods scheduled on wrong nodes (no affinity)

Detection: noisy-neighbor performance issues.

Root: no nodeAffinity/resource constraints.

Fix: introduce affinity, taints/tolerations, and resource requests/limits.



5. ConfigMap change not rolled out to pods

Detection: app still using old config.

Root: no trigger to restart deployments on config change.

Fix: checksum annotations in deployment spec, or manual rollout restart.



6. ImagePullBackOff due to private registry auth issues

Detection: pod pending with ImagePullBackOff.

Root: bad imagePullSecret.

Fix: refresh credentials, centralize secret management, monitoring for pull failures.



7. Log volume explosion filling node disk

Detection: node disk alerts, kubelet issues.

Root: app logging debug in prod, no log rotation.

Fix: adjust log level, introduce log rotation, disk usage alerts.



8. CronJobs overlapping causing DB contention

Detection: DB locks / slow queries at specific times.

Root: job schedule overlap.

Fix: ‘concurrencyPolicy: Forbid’ or ‘Replace’, reschedule jobs.



9. Intermittent DNS failures inside cluster

Detection: random “host not found” in logs.

Root: CoreDNS resource limits too low.

Fix: scale CoreDNS, assign dedicated resources, add health checks.



10. OpenShift route misconfig causing SSL handshake failures

Detection: customers report SSL errors, logs show TLS handshake issues.

Root: wrong TLS termination config.

Fix: correct route configuration, standardize TLS policies.





---

B. Observability / Monitoring (11–20)

11. No single pane of glass across 150 apps

Detection: slow incident triage.

Root: fragmented tools and dashboards.

Fix: standard Grafana dashboards, naming conventions, golden signals.



12. Alert fatigue (hundreds of non-actionable alerts)

Detection: team ignores pages.

Root: too many noisy thresholds.

Fix: de-duplicate, introduce SLOs, route only actionable alerts.



13. Missing alerts for partial degradation

Detection: users complaining while monitoring is green.

Root: only up/down monitoring.

Fix: add latency and error-rate alerts, synthetic checks.



14. No correlation between app, infra, and logs

Detection: long RCA cycles.

Root: no unified tracing/correlation IDs.

Fix: standard correlation ID, integrated logs+APM+metrics.



15. Capacity issues only visible at end-of-month

Detection: spikes around EOM causing slowness.

Root: no capacity trend analysis.

Fix: capacity dashboards, predictive planning with business calendar.



16. Silent failures in batch jobs

Detection: business points out missing trades next morning.

Root: no monitoring on job outcome, only infra.

Fix: application-level SLI – records processed, failures, lag.



17. SSL certificate expiry causing outage

Detection: customers unable to connect, TLS errors.

Root: manual cert management.

Fix: centralized cert management, expiry alerts, automation.



18. Slow RCAs due to poor log search performance

Detection: 10–15 minutes to query logs.

Root: badly indexed fields, log retention design.

Fix: optimize indices, structured logging, tiered storage.



19. No business KPI monitoring (only technical)

Detection: issue where technical metrics looked fine, but revenue-impacting bug existed.

Root: missing business metrics.

Fix: add KPIs (transactions, failed orders) alongside infra metrics.



20. Prometheus scraping failures

Detection: gaps in metrics.

Root: wrong scrape configs, target changes.

Fix: service discovery, relabeling, alert on “no data”.





---

C. Incidents / Reliability / Platform (21–35)

21. JVM memory leaks causing trading app crashes

Detection: heap usage trend, GC logs.

Root: bad code path after feature release.

Fix: fix leak, add load test + chaos test, standard JVM baseline.



22. Autonomous restart loop after config change

Detection: pod flaps after deployment.

Root: invalid config not validated.

Fix: config validation in pipeline, canary release.



23. DR failover not working during actual outage

Detection: DR failed when needed.

Root: DR never tested end-to-end.

Fix: regular DR drills, automated runbooks.



24. High MTTR due to unclear ownership

Detection: war rooms wasting time identifying teams.

Root: no service catalog.

Fix: build service catalog, rota mapping, clear escalation paths.



25. Repeated incidents from same root cause

Detection: problem tickets show pattern.

Root: RCAs not resulting in real changes.

Fix: problem management with action tracking, no closure without prevention steps.



26. Unpatched OS leading to stability and risk issues

Detection: vendor advisories, infra incidents.

Root: irregular patching.

Fix: patch calendar, maintenance windows, pre-flight checks.



27. Traffic spike after market news causing outage

Detection: sudden traffic surge.

Root: insufficient autoscaling policy.

Fix: HPA tuning, load test for spike scenarios.



28. Slow dependency (third-party API) causing cascading failures

Detection: app latency but infra OK.

Root: external dependency slowness, no timeouts.

Fix: timeouts, circuit breakers, fallbacks.



29. File system fill-up on shared NFS

Detection: app IO errors.

Root: no cleaning of temp files.

Fix: retention policy, monitoring, archive strategy.



30. Manual on-call handovers causing confusion

Detection: missed alerts at shift changes.

Root: unstructured handover.

Fix: standard handover notes, shared dashboards, rota tooling.



31. Configuration mismatch between environments

Detection: only prod failing.

Root: inconsistent config management.

Fix: single source of truth (Git), environment overlays.



32. Slow database queries during peak

Detection: DB CPU high, slow queries.

Root: missing indexes or bad query patterns.

Fix: query optimization, index tuning, caching.



33. Network segmentation change breaking services

Detection: sudden connectivity errors.

Root: firewall/ACL change.

Fix: pre-change testing, network observability, standard change templates.



34. Legacy job scheduler causing missed jobs

Detection: jobs randomly not running.

Root: old scheduler with no HA.

Fix: migrate to AutoSys/modern scheduler, add HA and monitoring.



35. Unexpected rollback causing data inconsistency

Detection: some users see old data.

Root: rollback of app without DB compatibility check.

Fix: backward-compatible DB changes, clear rollback strategy.





---

D. Security-Aligned / Risk / Governance (36–50)

36. Excessive production access for support engineers

Detection: audit findings.

Root: historical “everyone has access” culture.

Fix: RBAC, break-glass access, session recording.



37. Unencrypted secrets in config

Detection: config review.

Root: secrets in plain text.

Fix: secrets manager, sealed-secrets, strict reviews.



38. Shadow changes in production

Detection: change didn’t appear in change logs.

Root: direct edits on servers.

Fix: remove direct access, enforce changes only via pipeline.



39. Missing audit logs for admin operations

Detection: RCA needed proof of actions.

Root: incomplete logging.

Fix: audit logging on all admin operations.



40. Sensitive logs exposed (PII in logs)

Detection: log review.

Root: devs logging too much detail.

Fix: logging standards, PII scrubbing, lint checks in CI.



41. Broken TLS configuration on internal services

Detection: security scan.

Root: outdated cipher suites.

Fix: central TLS policy, automated config generation.



42. Third-party library vulnerability (Log4j-style)

Detection: security bulletin.

Root: no SBOM tracking.

Fix: dependency scanning in CI, SBOM, patch runbook.



43. Backups not restorable

Detection: test restore failed.

Root: only “backup success” monitored, not restore.

Fix: regular restore tests, documented RPO/RTO.



44. Inconsistent password/secret rotation

Detection: expired credentials outages.

Root: manual tracking.

Fix: rotation schedule, automation via secret manager.



45. API rate limits missing

Detection: abusive clients slowed whole system.

Root: no rate limiting or quotas.

Fix: API gateway with limits, per-client quotas.



46. No segregation between lower and prod environments

Detection: data leak risk identified.

Root: prod-like data in lower envs.

Fix: data masking, separate credentials, strict access.



47. Incorrectly configured S3/object storage bucket

Detection: security review.

Root: over-permissive access policies.

Fix: least-privilege IAM policies, encryption and access logs.



48. Lack of formal risk scoring for changes

Detection: all changes treated the same.

Root: no risk-based change model.

Fix: risk scoring (low/med/high), approvals based on risk.



49. No standard onboarding for new services into SRE

Detection: services in prod with no monitoring or runbooks.

Root: no onboarding checklist.

Fix: SRE readiness checklist (monitoring, alerts, runbooks, DR, security).



50. Inadequate documentation for critical runbooks

Detection: delays in incident response.

Root: outdated or missing runbooks.

Fix: documentation standards, periodic runbook drills, versioning in Git.




Pick 5–7 of these and rehearse them deeply with full STAR answers.


---

3️⃣ Downloadable PDF Cheat Sheet

I’ve created a concise PDF you can quickly revise before the interview:

👉 Download the Lloyds SRE Manager Cheat Sheet

It includes:

Your positioning (how to introduce yourself)

A clean 2-minute intro script

5 flagship stories to reuse

STAR template

Cybersecurity-friendly phrases to repeat

Do/Don’t checklist before interview



---

If you want, next we can do a mock Q&A where you type your answer and I refine it to sound sharper and more aligned to Lloyds + cybersecurity mindset.

Monday, August 11, 2025

kube

Here’s a 50-point crisp checklist of Kubernetes production support best practices you can use in interviews — short, sharp, and easy to remember.


---

Kubernetes Production Support – 50 Best Practices

A. Cluster & Node Management

1. Keep Kubernetes version up to date with LTS and security patches.
2. Use multiple master nodes for HA (High Availability)
3. Label and taint nodes for workload segregation.
4. Use autoscaling for nodes (Cluster Autoscaler).


5. Reserve system resources on nodes using --system-reserved.


6. Regularly monitor node health via kubectl get nodes.


7. Spread workloads across zones/regions for resilience.


8. Avoid overcommitting node resources beyond safe limits.


9. Ensure OS and kernel are tuned for container workloads.


10. Apply OS-level security updates on nodes regularly.




---

B. Pod & Workload Management

11. Use resource requests and limits for all pods.


12. Configure PodDisruptionBudgets to avoid downtime during maintenance.


13. Use Readiness and Liveness probes for health checks.


14. Implement pod anti-affinity to avoid co-locating critical workloads.


15. Use init containers for dependency checks before main app starts.


16. Deploy workloads via Deployment, StatefulSet, or DaemonSet as per use case.


17. Keep images lightweight and scan for vulnerabilities.


18. Avoid running pods as root.


19. Use imagePullPolicy=IfNotPresent for stable deployments.


20. Tag images with version, not latest.




---

C. Networking & Service Management

21. Use ClusterIP for internal services, LoadBalancer/Ingress for external.


22. Secure Ingress with TLS (Let’s Encrypt or custom certs).


23. Use NetworkPolicies to control pod-to-pod communication.


24. Avoid exposing the API server publicly.


25. Keep DNS resolution stable via CoreDNS monitoring.


26. Use headless services for Stateful workloads.


27. Implement connection timeouts and retries in services.


28. Configure externalTrafficPolicy=Local for preserving client IP.


29. Limit public access to services via firewalls or security groups.


30. Load-test services before going live.




---

D. Observability & Troubleshooting

31. Integrate Prometheus & Grafana for metrics monitoring.


32. Centralize logs via ELK or Loki.


33. Enable Kubernetes audit logging for API server.


34. Set up alerts for pod restarts, CPU/memory saturation.


35. Use kubectl describe and kubectl logs for quick debugging.


36. Maintain runbooks for common incident scenarios.


37. Use kubectl top to identify resource bottlenecks.


38. Set up traceability with OpenTelemetry/Jaeger.


39. Store historical metrics for capacity planning.


40. Regularly test disaster recovery (DR) playbooks.




---

E. Security & Compliance

41. Enable Role-Based Access Control (RBAC).


42. Use namespaces for workload isolation.


43. Scan container images before deployment.


44. Use secrets in Kubernetes Secret objects, not plain env variables.


45. Rotate secrets and credentials periodically.


46. Enable API authentication and authorization.


47. Restrict kubectl exec access in production.


48. Use CIS Kubernetes Benchmark for compliance checks.


49. Enable admission controllers (PodSecurity, ValidatingWebhook).


50. Perform periodic security audits with tools like Kube-bench/Kubescape.




---

If you want, I can also convert these 50 points into an interview-ready “cheatsheet” in .docx format so you can print and revise quickly.

Do you want me to prepare that formatted docx for you now?


Sunday, August 10, 2025

50 best practices in kubernetes



---

Kubernetes Production Support – 50 Best Practices (Interview Edition)


---

A. Cluster & Node Management

1. Keep Kubernetes version up to date
❌ Old versions miss security patches → vulnerable cluster.
✅ Upgrade using kubeadm upgrade and plan downtime in stages.

kubectl version --short
kubeadm upgrade plan


2. Use multiple master nodes (HA)
❌ Single master = control plane outage if node fails.
✅ Deploy at least 3 masters in HA setup.
(YAML: kubeadm config with stacked etcd)


3. Label & taint nodes
❌ Workloads run on wrong nodes → performance/security risk.
✅ Use labels for scheduling, taints to block unwanted pods.

kubectl label node node1 role=db
kubectl taint nodes node1 dedicated=db:NoSchedule


4. Enable Cluster Autoscaler
❌ Manual scaling → delays & outages under load.
✅ Deploy autoscaler with cloud provider integration.

kubectl apply -f cluster-autoscaler.yaml


5. Reserve system resources
❌ Kubelet starved → node unstable.
✅ Add --system-reserved in kubelet config.
(kubelet config YAML)


6. Monitor node health
❌ Node failures unnoticed → pod downtime.
✅ Use kubectl get nodes + Prometheus alerts.

kubectl get nodes -o wide


7. Spread workloads across zones
❌ Zone outage takes all workloads down.
✅ Use topology spread constraints or node labels.

topologySpreadConstraints:
  - maxSkew: 1


8. Avoid overcommitting resources
❌ Pods evicted due to memory pressure.
✅ Monitor requests/limits ratio in Grafana.

kubectl top nodes


9. Tune OS/kernel for containers
❌ Network & disk latency issues.
✅ Enable cgroupv2, adjust sysctl params.

sysctl -w net.ipv4.ip_forward=1


10. Apply OS security updates
❌ Vulnerable kernel exploited.
✅ Automate patching with maintenance windows.

apt update && apt upgrade -y




---

B. Pod & Workload Management

11. Set resource requests/limits
❌ Pods hog resources → others throttled.
✅ Define CPU/memory in manifests.

resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi


12. Configure PodDisruptionBudgets
❌ All pods evicted during maintenance.
✅ Set minAvailable or maxUnavailable.

minAvailable: 2


13. Readiness/Liveness probes
❌ Unhealthy pods still receive traffic.
✅ HTTP/TCP probes in manifest.

livenessProbe:
  httpGet:
    path: /health
    port: 8080


14. Pod anti-affinity for critical apps
❌ Critical pods on same node → single point failure.
✅ Set requiredDuringSchedulingIgnoredDuringExecution.

podAntiAffinity: ...


15. Init containers for dependencies
❌ Main app starts before DB ready.
✅ Init container checks service availability.

initContainers: ...


16. Use correct controller type
❌ Stateful apps lose data with Deployments.
✅ Use StatefulSet for stateful workloads.


17. Lightweight, scanned images
❌ Large images slow deploy, vulnerabilities possible.
✅ Use trivy/grype for scans.


18. No root containers
❌ Privilege escalation risk.
✅ securityContext.runAsNonRoot: true.


19. Use imagePullPolicy=IfNotPresent
❌ Unnecessary image pulls → deploy delays.
✅ Set in manifests.


20. Version-tag images
❌ Latest tag causes inconsistent rollouts.
✅ Use semantic version tags.




---

C. Networking & Service Management

21. Right service type
❌ Exposing internal services publicly.
✅ ClusterIP internal, LoadBalancer/Ingress for external.


22. Secure Ingress with TLS
❌ Plaintext traffic vulnerable to sniffing.
✅ TLS cert in Ingress manifest.


23. NetworkPolicies
❌ Pods can talk to everything.
✅ Allow only required traffic.


24. No public API server
❌ Cluster takeover risk.
✅ Restrict via firewall/security groups.


25. Stable DNS via CoreDNS monitoring
❌ Service resolution failures.
✅ Alerts on CoreDNS pod health.


26. Headless services for Stateful workloads
❌ Stateful pods fail to discover peers.
✅ clusterIP: None in Service.


27. Connection timeouts/retries
❌ Hanging requests block clients.
✅ App-level configs + Istio retries.


28. externalTrafficPolicy=Local
❌ Client IP lost for logging.
✅ Set in Service manifest.


29. Limit public access
❌ Attackers exploit open services.
✅ Security groups + firewall rules.


30. Load-test before go-live
❌ Crashes under real traffic.
✅ Use k6/locust.




---

D. Observability & Troubleshooting

31. Prometheus + Grafana
❌ No performance visibility.
✅ Deploy kube-prometheus-stack.


32. Centralized logs (ELK/Loki)
❌ No log correlation during incidents.
✅ Fluentd/FluentBit collectors.


33. Enable audit logging
❌ No trace of API actions.
✅ API server --audit-log-path.


34. Alerts for restarts/resource issues
❌ Issues unnoticed until outage.
✅ Prometheus rules.


35. kubectl describe/logs
❌ Slow troubleshooting.
✅ Standard first step.


36. Runbooks
❌ Inconsistent incident handling.
✅ Confluence/Docs with steps.


37. kubectl top for bottlenecks
❌ Capacity issues unidentified.
✅ Resource tuning.


38. Distributed tracing
❌ Slow services hard to debug.
✅ Jaeger/OpenTelemetry.


39. Historical metrics
❌ No capacity planning data.
✅ Long-term storage in Thanos.


40. DR playbook testing
❌ Failover fails during disaster.
✅ Quarterly drills.




---

E. Security & Compliance

41. RBAC
❌ Users have excessive permissions.
✅ Role/RoleBinding per namespace.


42. Namespaces for isolation
❌ Cross-application interference.
✅ One namespace per app/team.


43. Image scanning
❌ Deploying vulnerable images.
✅ trivy CI scan.


44. Secrets management
❌ Credentials exposed in plain text.
✅ kubectl create secret.


45. Rotate secrets
❌ Stolen creds remain valid.
✅ Automate with Vault/KMS.


46. API auth & authorization
❌ Unauthorized cluster actions.
✅ Certs, tokens, OIDC.


47. Restrict kubectl exec
❌ Attackers run commands inside pods.
✅ PSP or RBAC deny.


48. CIS Benchmark
❌ Non-compliant cluster.
✅ kube-bench checks.


49. Admission controllers
❌ Bad manifests deployed.
✅ PodSecurity/ValidatingWebhook.


50. Periodic security audits
❌ Vulnerabilities stay unnoticed.
✅ kubescape scans.




---

If you want, I can now convert this into a single-page “Kubernetes Production Support Interview Cheatsheet” PDF so you can print and use before interviews.
Do you want me to prepare that?


kuber



---

1) Keep Kubernetes versions up to date

What will happen if you don’t: Security vulnerabilities, incompatibilities, unsupported behavior and lack of bug fixes.

How: Follow Kubernetes upgrade policy; test upgrades in staging; upgrade control plane first, then worker nodes.

Key commands:

kubectl version --short
kubeadm upgrade plan
kubeadm upgrade apply v1.27.6

Example: (upgrade plan is CLI-driven; no YAML)



---

2) Use multiple master/control-plane nodes for HA

What will happen if you don’t: Single control-plane node becomes a single point of failure; cluster control operations may stop.

How: Deploy at least 3 control-plane nodes with etcd quorum; use external load balancer in front of API servers.

Key commands: (control plane bootstrap via kubeadm / cloud provider)

# Example: check control plane endpoints
kubectl get endpoints -n default

YAML/snippet: (LB config is infra-specific; example kubeadm init with control-plane endpoint)

kubeadm init --control-plane-endpoint "api.mycluster.example:6443" --upload-certs



---

3) Label and taint nodes for workload segregation

What will happen if you don’t: Critical pods may co-locate with noisy or untrusted workloads; scheduling may place wrong apps on wrong hardware.

How: Use kubectl label and kubectl taint to dedicate nodes (e.g., GPU, high-memory).

Key commands:

kubectl label node node01 node-role.kubernetes.io/highmem=true
kubectl taint nodes node01 dedicated=highmem:NoSchedule

YAML (Pod using nodeSelector / toleration):

spec:
  nodeSelector:
    node-role.kubernetes.io/highmem: "true"
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "highmem"
    effect: "NoSchedule"



---

4) Use Cluster Autoscaler (node autoscaling)

What will happen if you don’t: Under-provisioning during spikes causes pending pods; over-provisioning wastes cost.

How: Install Cluster Autoscaler configured per cloud provider; tune scale-up/down policies and node groups.

Key commands:

# check CA deployment
kubectl get deployment cluster-autoscaler -n kube-system
kubectl logs -f deploy/cluster-autoscaler -n kube-system

YAML (typical Deployment args excerpt):

spec:
  containers:
  - name: cluster-autoscaler
    args:
    - --cloud-provider=aws
    - --nodes=1:10:node-group-name



---

5) Reserve system resources on nodes (system-reserved/kube-reserved)

What will happen if you don’t: Kubelet and system daemons can be starved of CPU/memory causing node instability.

How: Configure kubelet flags or kubelet config to reserve CPU/memory for system and kube components.

Key commands: (edit kubelet config or systemd args, then restart kubelet)

# example check
kubectl describe node node01 | grep -i reserved

kubelet config snippet:

kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
systemReserved:
  cpu: "500m"
  memory: "512Mi"
kubeReserved:
  cpu: "500m"
  memory: "512Mi"



---

6) Monitor node health continuously

What will happen if you don’t: Node failures go unnoticed until apps fail; slow detection prolongs incidents.

How: Integrate Prometheus node exporters, alert on node:node_cpu:, node_memory_ and kube_node_status_condition.

Key commands:

kubectl get nodes
kubectl describe node <node>

PromQL example alert:

kube_node_status_condition{condition="Ready",status="true"} == 0



---

7) Spread workloads across zones/regions

What will happen if you don’t: AZ failure brings down many pods; reduced resilience and higher blast radius.

How: Use topologySpreadConstraints, pod anti-affinity, and multiple node pools across AZs.

Key commands: (inspect topology)

kubectl get nodes -o wide
kubectl get pods -o wide --field-selector=status.phase=Running

YAML (topologySpreadConstraints sample):

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: myapp



---

8) Avoid overcommitting node resources

What will happen if you don’t: Nodes become CPU/memory-saturated causing OOMKill, node pressure, eviction cascades.

How: Enforce resource requests/limits, and use ResourceQuotas & LimitRanges in namespaces.

Key commands:

kubectl get resourcequota -n <ns>
kubectl describe limitrange -n <ns>

YAML (limitrange example):

apiVersion: v1
kind: LimitRange
metadata: { name: limits }
spec:
  limits:
  - default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "200m"
      memory: "256Mi"
    type: Container



---

9) Tune OS & kernel for container workloads

What will happen if you don’t: Suboptimal networking, CPU scheduling, and disk performance; noisy neighbor issues.

How: Set sysctl parameters, enable overlayfs options, tune ephemeral storage and file descriptors, set net.ipv4.ip_forward=1, etc.

Key commands: (example sysctl inspect/apply)

sysctl net.ipv4.ip_forward
sudo sysctl -w net.ipv4.ip_forward=1

Example (sysctl in Pod via securityContext / sysctls):

securityContext:
  sysctls:
  - name: net.ipv4.ip_forward
    value: "1"



---

10) Apply OS-level security updates regularly

What will happen if you don’t: Nodes become vulnerable to exploits; container runtimes and kernel exploits risk cluster compromise.

How: Patch OS in a rolling manner (cordon → drain → update → uncordon), use immutable images for hosts or managed node pools.

Key commands:

kubectl cordon node01
kubectl drain node01 --ignore-daemonsets --delete-local-data
# perform OS update on node
kubectl uncordon node01

YAML: (no YAML — operational workflow)


Alright — let’s go deep into Scenario 1 and Scenario 2, covering:

What will happen (impact & symptoms)

Why it happens (root cause)

How to fix it (step-by-step)

Commands to diagnose & resolve

YAML examples if applicable



---

Scenario 1 – Pods Stuck in CrashLoopBackOff

What will happen

Pod keeps restarting in a loop instead of running normally.

Status shows CrashLoopBackOff in kubectl get pods.

Application inside the container starts, fails quickly, and Kubernetes retries indefinitely (with backoff delay increasing each time).


Example:

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
app-pod-1 0/1 CrashLoopBackOff 5 2m


---

Why it happens

Common causes:

1. Application code crashes (exception, missing file, bad config).


2. Wrong environment variables (DB host, credentials missing).


3. Port conflict (two processes binding same port).


4. Readiness/Liveness probes failing → Kubernetes kills and restarts container.




---

How to fix it

Step 1 – Check pod logs:

kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous # previous container run

Step 2 – Describe pod for events:

kubectl describe pod <pod-name> -n <namespace>

Look for:

Probe failures

ImagePullBackOff

OutOfMemoryKilled


Step 3 – If it’s a config/env issue:

Update ConfigMap or Secret.


kubectl edit configmap <configmap-name> -n <namespace>
kubectl rollout restart deployment <deployment-name> -n <namespace>

Step 4 – If probe is too aggressive:

Relax initialDelaySeconds or timeoutSeconds.



---

YAML Example – Fixing a Liveness Probe Failure

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
  namespace: demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sample
  template:
    metadata:
      labels:
        app: sample
    spec:
      containers:
      - name: sample-container
        image: myregistry.com/sample:1.0
        ports:
        - containerPort: 8080
        envFrom:
        - configMapRef:
            name: app-config
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          timeoutSeconds: 5
          failureThreshold: 5


---

✅ Quick Tip: If you just want to debug without probe restarts:

kubectl edit deployment sample-app -n demo
# Remove the livenessProbe temporarily


---

Scenario 2 – Pods Stuck in Pending

What will happen

Pods stay in Pending state, never starting containers.

Seen in:


$ kubectl get pods
NAME READY STATUS RESTARTS AGE
app-pod-2 0/1 Pending 0 5m


---

Why it happens

1. No matching nodes (NodeSelector, Affinity, Taints prevent scheduling).


2. Insufficient resources (CPU/memory requests too high).


3. Storage issues (PVC cannot be bound to a PV).


4. Cluster Autoscaler not scaling up nodes.




---

How to fix it

Step 1 – Describe pod:

kubectl describe pod <pod-name> -n <namespace>

Look for:

0/3 nodes are available: insufficient memory

0/3 nodes are available: node(s) didn't match node selector

persistentvolumeclaim is not bound


Step 2 – If resource request is too high:

kubectl edit deployment <deployment-name> -n <namespace>
# Reduce requests under spec.containers.resources.requests

Step 3 – If PVC not bound:

Check:

kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

Create matching PV if needed:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv1
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data/pv1

Step 4 – If affinity/taint issue:

Remove or adjust affinity/taint rules in YAML.



---

YAML Example – Adjusting NodeSelector

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sample
  template:
    metadata:
      labels:
        app: sample
    spec:
      nodeSelector:
        kubernetes.io/hostname: worker-1 # Remove if blocking scheduling
      containers:
      - name: sample-container
        image: myregistry.com/sample:1.0
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"


---

Alright, let’s go deep into Scenario 1 and Scenario 2 with:

What will happen (Impact)

How it happens (Root cause)

Troubleshooting & Fix steps

Commands (for diagnosis and fix)

YAML examples (where applicable)



---

Scenario 1 – Pods Stuck in CrashLoopBackOff

What Will Happen

Pod continuously restarts after failing to start successfully.

Application downtime until the issue is fixed.

CPU/memory usage spikes due to repeated container restarts.

In production, this may cause cascading failures if dependent services rely on this pod.



---

How It Happens

Application process exits with a non-zero status code.

Missing or incorrect environment variables.

Dependencies (DB, API) not reachable.

Readiness/liveness probes failing repeatedly.

ConfigMap/Secret values missing or wrong.



---

Troubleshooting Steps

1. Check pod status and events



kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

2. Check logs of the container



kubectl logs <pod-name> -n <namespace> --previous

3. Verify configuration files and environment variables



kubectl exec -it <pod-name> -n <namespace> -- env

4. Check readiness/liveness probes



kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 readinessProbe

5. Check dependent services



kubectl run tmp-shell --rm -it --image=busybox -- sh
# ping DB, API, etc.


---

Example Fix YAML

If liveness/readiness probes are too strict:

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10


---

Production Tip

If issue persists but you need to stop restart loops temporarily:

kubectl scale deployment <deployment-name> --replicas=0 -n <namespace>


---

Scenario 2 – ImagePullBackOff / ErrImagePull

What Will Happen

Pod never starts because the container image cannot be pulled.

Affected application remains unavailable.

If multiple pods use the same image, a large part of the service might be down.



---

How It Happens

Wrong image name, tag, or registry URL.

Image deleted or tag changed in registry.

Private registry requires credentials but none provided.

Network/DNS issues blocking access to registry.



---

Troubleshooting Steps

1. Check events for detailed pull error



kubectl describe pod <pod-name> -n <namespace>

2. Verify image name and tag



kubectl get pod <pod-name> -n <namespace> -o yaml | grep image:

3. Test registry access from inside the cluster



kubectl run tmp-shell --rm -it --image=busybox -- sh
# Inside shell
wget <registry-url>

4. Check imagePullSecrets



kubectl get secrets -n <namespace>
kubectl describe secret <secret-name> -n <namespace>


---

Example Fix YAML

If using a private registry:

apiVersion: v1
kind: Pod
metadata:
  name: my-private-pod
spec:
  containers:
    - name: my-app
      image: myprivateregistry.com/app:1.0
  imagePullSecrets:
    - name: myregistry-secret

Create secret:

kubectl create secret docker-registry myregistry-secret \
  --docker-server=myprivateregistry.com \
  --docker-username=myuser \
  --docker-password=mypass \
  --docker-email=myemail@example.com \
  -n <namespace>


---

If you want, I can now do Scenario 3 (Pods stuck in Pending) and Scenario 4 (OOMKilled) in the same style so you’ll have a continuous production playbook. That will make it easier to compile into one printable PDF.

Great — continuing with the same deep-dive style. Here are Scenario 3 (Pending) and Scenario 4 (OOMKilled) with impact, causes, step-by-step troubleshooting, commands, YAML snippets, and quick production tips.


---

Scenario 3 – Pod Stuck in Pending

What will happen

Pod remains in Pending and never transitions to Running.

No containers start; the workload is unavailable.

Often a sign of scheduling/resource or storage problems.


How it happens (common root causes)

Insufficient cluster resources (CPU / memory) to satisfy requests.

NodeSelector / NodeAffinity / taints block scheduling.

PVC is not bound (no matching PV).

Pod topology constraints or quota limits preventing placement.

Cluster Autoscaler not configured or unable to scale.


Troubleshooting & Fix Steps

1. Describe the pod to see scheduler events



kubectl describe pod <pod-name> -n <ns>
# Look for messages like: "0/5 nodes are available: insufficient memory" or "node(s) didn't match node selector"

2. Check node capacity and available resources



kubectl get nodes -o wide
kubectl top nodes
kubectl describe node <node-name>

3. Check resource requests/limits of the pod



kubectl get pod <pod-name> -n <ns> -o yaml | yq '.spec.containers[].resources'
# or
kubectl describe pod <pod-name> -n <ns> | grep -A5 "Requests"

If requests too high → edit Deployment to lower requests.


4. Check node selectors / affinity / taints



kubectl get pod <pod-name> -n <ns> -o yaml | yq '.spec | {nodeSelector: .nodeSelector, affinity: .affinity, tolerations: .tolerations}'
kubectl get nodes --show-labels
kubectl describe node <node> | grep Taints -A2

Remove or relax overly strict selectors/affinities or add matching node labels.


5. If PVC is pending, inspect PVC/PV



kubectl get pvc -n <ns>
kubectl describe pvc <pvc-name> -n <ns>
kubectl get pv

Create a matching PV or adjust StorageClass.


6. If cluster autoscaler should add nodes, check CA logs



kubectl logs deploy/cluster-autoscaler -n kube-system

Adjust CA node-group min/max or node group configuration.


Commands to remediate (examples)

Reduce resource requests:


kubectl set resources deployment/<deploy> -n <ns> --requests=cpu=200m,memory=256Mi

Remove a nodeSelector (edit deployment):


kubectl edit deploy <deploy> -n <ns>
# remove spec.template.spec.nodeSelector section

Create a simple PV for PVC binding:


apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-small
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: /mnt/data/pv-small

kubectl apply -f pv-small.yaml

Quick Production Tip

Enforce default requests via LimitRange and use ResourceQuotas to prevent runaway requests that keep pods pending.



---

Scenario 4 – OOMKilled (Container Killed Due to Out Of Memory)

What will happen

Container process is killed by the kernel (OOM Killer).

Pod restarts; repeated OOMs lead to CrashLoopBackOff or degraded service.

Memory pressure can affect co-located pods and node stability.


How it happens (common root causes)

Container memory limit too low for the workload.

Memory leak in the application.

Bursty workload without proper resource provisioning.

No limits set → node exhaustion leading to multiple pod evictions.


Troubleshooting & Fix Steps

1. Describe the pod to confirm OOMKilled



kubectl describe pod <pod-name> -n <ns> | grep -i -A5 "State" 
# Look for 'Reason: OOMKilled' in container status

2. Check container logs & previous logs



kubectl logs <pod-name> -n <ns>
kubectl logs <pod-name> -n <ns> --previous

3. Check resource usage



kubectl top pod <pod-name> -n <ns>
kubectl top node <node-name>

4. Inspect kubelet and syslogs on the node (if you have node access)



journalctl -u kubelet -n 200
dmesg | grep -i -E "oom|killed process"

5. If memory leak suspected: attach profiler, heap dump, or increase logging to capture allocations.



Commands & Remediations

Increase memory limit (imperative)


kubectl set resources deployment/<deploy> -n <ns> --limits=memory=1Gi --requests=memory=512Mi

Edit deployment (declarative)


spec:
  containers:
  - name: app
    image: myapp:1.2
    resources:
      requests:
        memory: "512Mi"
        cpu: "250m"
      limits:
        memory: "1Gi"
        cpu: "1"

Temporarily reduce load by scaling down replicas or rate-limiting traffic via Ingress:


kubectl scale deploy <deploy> --replicas=1 -n <ns>

If node OOM is observed: cordon & drain node, investigate other pods:


kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-local-data

Quick Production Tip

Set both requests and limits. Requests ensure scheduler places the pod properly; limits prevent a single pod from starving others. Use monitoring (Prometheus) to create alerting on memory RSS close to limit (e.g., >80%).



---

Would you like me to continue with Scenario 5 (Service Not Accessible) and Scenario 6 (DNS/CoreDNS issues) next in the same format?