Friday, August 1, 2025

interview questions on llyods

1)What is inode in unix
2)What is difference between su and su -
3)What is difference between sed and awk how u have used
4)Have you ever set up error budgeting?
5)Have you given any hard message to team member who is inconsistency perfomer and give me a examples 

6)What type of blameless postmortem you have done and explain with a real time scenario ?
7)Difference between configmap vs secret?
8)what are 4 golden signals of sre?
9)what is throughput
10)have you ever implemented sli slo sla
11) what is glass table in splunk 


Here are detailed answers for each of your 10 interview questions with real-world scenarios, commands, and SRE/Unix best practices:


---

1) What is an inode in Unix?

An inode (Index Node) is a data structure in a Unix filesystem that stores information about a file except for its name and actual data. It includes:

File type (regular, directory, symlink, etc.)

Permissions (rwx)

Owner (UID), group (GID)

File size

Timestamps (created, modified, accessed)

Pointers to data blocks on disk


🔧 Command to check inode usage:

df -i # Shows inode usage for each mount point

🧩 Real-time scenario:
We faced a production issue where our application couldn’t create new files, even though df -h showed sufficient space. Running df -i revealed inode exhaustion due to millions of small log files in one directory. We archived and deleted older logs, then implemented logrotate.


---

2) What is the difference between su and su -?

su username: Switches user but retains current environment variables, such as PATH.

su - username: Switches user and loads the target user's full login environment (~/.bash_profile, ~/.profile, etc.)


🧩 Real-world implication:
I used su to switch to oracle user and noticed missing Oracle environment variables. After switching with su - oracle, the environment was correctly loaded, allowing DB scripts to run properly.


---

3) What is the difference between sed and awk? How have you used them?

sed: A stream editor for text substitution and simple edits line by line.

awk: A full text processing language, suitable for column-based data extraction, summaries, filtering, and reports.


🛠️ Usage examples:

sed: Replace IP in config


sed -i 's/192.168.0.1/10.0.0.1/g' config.yaml

awk: Get CPU usage from top


top -bn1 | grep "Cpu(s)" | awk '{print $2 + $4}' # Sum of user+system CPU

🧩 Scenario:
I used awk in a cron job to monitor log file growth:

ls -l /var/log/*.log | awk '{sum += $5} END {print sum/1024/1024 " MB"}'


---

4) Have you ever set up error budgeting?

✅ Yes.

Error budgeting is the difference between 100% availability and your defined SLO (e.g., 99.9%). It gives you a budget for failure and guides feature release decisions.

🧩 Scenario: At Wells Fargo, we defined an SLO of 99.95% for a client-facing wealth dashboard. We monitored uptime with Prometheus + Blackbox exporter. During a bad deploy that caused 45 minutes of downtime in a month (exceeding our error budget), we paused new releases and conducted a root cause analysis to improve our release pipeline with better canary checks.


---

5) Have you given a hard message to a team member who is an inconsistent performer? Example?

✅ Yes.

🧩 Scenario:
A team member consistently missed SLAs during P1 incidents and delayed handovers during shift changes. I scheduled a one-on-one, gave constructive feedback with examples, and asked about underlying blockers. We agreed on an improvement plan — paired him with a mentor and tracked performance weekly. Over time, he improved and regained confidence.

🔑 Key: Focused on behavior, not personality. Used data and empathy.


---

6) What type of blameless postmortem have you done? Real-time scenario?

✅ Blameless postmortems focus on fixing the system, not blaming individuals.

🧩 Scenario:
An app failed during a patch window due to a missed configuration in an Ansible playbook. In the postmortem:

We avoided finger-pointing.

Identified that code review didn’t include infra config changes.

Introduced a checklist-based deployment with automated playbook dry-runs.

Added a “pre-patch simulation” pipeline in Jenkins.


🎯 Result: We improved team trust and reduced similar issues in future patches.


---

7) Difference between ConfigMap vs Secret?

Feature ConfigMap Secret

Purpose Store non-sensitive config data Store sensitive data like passwords
Encoding Plaintext Base64 encoded (not encrypted)
Use Case URLs, log levels, switches DB passwords, API tokens
Mounted As Volume or environment variable Same


🧩 Scenario:
In OpenShift, I used:

ConfigMap for log levels and feature toggles.

Secret for DB credentials and SSL certs, with RBAC restricted access.



---

8) What are the 4 golden signals of SRE?

1. Latency – Time taken to serve a request.


2. Traffic – Number of requests.


3. Errors – Rate of failed requests.


4. Saturation – Resource usage (CPU, memory, etc.)



🧩 Scenario:
Using Prometheus + Grafana, I built dashboards for a Spring Boot app:

Latency: Tracked /metrics/http_server_requests_seconds.

Errors: Alerted if 5xx errors > 5% over 10 mins.

Saturation: Monitored JVM heap and container memory.



---

9) What is throughput?

Throughput is the rate of successful requests handled by a system over time (e.g., requests/sec).

🧩 Scenario:
In load testing an API with JMeter, we measured:

Max throughput = 1500 requests/sec before hitting 95% CPU.

Post-optimization with caching, increased to 2200 requests/sec.



---

10) Have you ever implemented SLI, SLO, SLA?

✅ Yes.

Term Stands for Definition Example

SLI Service Level Indicator Actual metric 99.93% uptime
SLO Service Level Objective Target goal ≥ 99.95%
SLA Agreement External contract with penalties 99.9% uptime or refund


🧩 Scenario: We defined:

SLI: API availability measured via Prometheus.

SLO: 99.95% monthly uptime.

SLA: 99.9% in client contract.


We used Grafana alerting and tracked SLO burn rate. If SLO dipped, we paused releases and triggered reliability reviews.


---

Would you like a PDF version of this or need mock questions with answers tailored for Lloyds SRE Lead interview?

Absolutely. Below is the enhanced version of your 10 interview questions with:

✅ Concept Explanation
✅ Real-time Scenario
✅ Mitigation/Prevention Steps


---

✅ 1) What is an inode in Unix?

📘 Explanation:
An inode is a data structure storing file metadata (owner, permissions, timestamps, data pointers), but not the filename.

🧩 Real-time Scenario:
Disk usage (df -h) was fine, but we got “No space left on device”. Running df -i showed inodes were exhausted due to millions of log files.

🛠 Mitigation Steps:

Use find to archive/delete old files:

find /logs -type f -mtime +7 -exec gzip {} \;

Set up log rotation using logrotate.

Monitor inode usage:

df -i



---

✅ 2) What is the difference between su and su -?

📘 Explanation:

su: Switches user but retains current environment.

su -: Loads the target user’s full environment (like a fresh login).


🧩 Real-time Scenario:
A script failed under oracle user due to missing ORACLE_HOME. Worked fine with su - oracle.

🛠 Mitigation Steps:

Always use su - username when user-specific environments are required.

Add user env variables to .bash_profile.



---

✅ 3) What is the difference between sed and awk? How have you used them?

📘 Explanation:

sed: Stream editor for inline edits.

awk: Column-based pattern scanning & processing.


🧩 Real-time Scenario: Used awk to extract CPU info from top, and sed to update config files dynamically during deployments.

🛠 Mitigation Steps:

Validate regex in test environments before applying.

Add version control (Git) around scripts using sed/awk.

Use backups before performing mass updates:

sed -i.bak 's/old/new/g' config.txt



---

✅ 4) Have you ever set up error budgeting?

📘 Explanation:
Error Budget = 100% - SLO. It defines how much unreliability a service can tolerate before halting changes.

🧩 Real-time Scenario:
An SLO of 99.95% allowed ~22 minutes of downtime/month. After a failed release caused 40 mins of downtime, we froze releases.

🛠 Mitigation Steps:

Create burn rate dashboards.

Implement canary deployments and rollback strategies.

Add feature flag control for production toggling.

Use SLI-based gating in CI/CD pipelines.



---

✅ 5) Have you given a hard message to an inconsistent team member?

📘 Explanation:
Effective leadership involves giving difficult feedback constructively.

🧩 Real-time Scenario:
A team member ignored handoff protocols and missed several alerts. I held a 1-on-1, discussed specific examples, and set an improvement plan.

🛠 Mitigation Steps:

Track metrics: alert responses, ticket SLAs.

Document coaching conversations.

Use PIP (Performance Improvement Plan) if necessary.

Recognize improvement publicly to encourage morale.



---

✅ 6) What type of blameless postmortem have you done?

📘 Explanation:
Blameless postmortems analyze failures by fixing process gaps, not blaming people.

🧩 Real-time Scenario:
A missed variable in an Ansible playbook brought down a production app. We found no validation checks before the change.

🛠 Mitigation Steps:

Implement peer review and approval workflow.

Add automated playbook dry-runs before applying changes.

Maintain a postmortem template (summary, impact, root cause, corrective actions).



---

✅ 7) Difference between ConfigMap vs Secret?

Feature ConfigMap Secret

Purpose Non-sensitive data Sensitive data (passwords)
Encoding Plaintext Base64
Security Open Encrypted at rest (with KMS)


🧩 Real-time Scenario:
Used Secrets for DB credentials; ConfigMaps for log levels and toggles. Accidentally exposed a Secret in logs during debugging.

🛠 Mitigation Steps:

Restrict access using RBAC.

Enable automatic secret rotation.

Use sealed-secrets or HashiCorp Vault in production.

Avoid kubectl describe secret in shared screens/logs.



---

✅ 8) What are the 4 golden signals of SRE?

1. Latency – Time to serve a request.


2. Traffic – Request volume.


3. Errors – Failed requests.


4. Saturation – Resource usage (CPU, memory).



🧩 Real-time Scenario:
High CPU usage caused elevated latency. Grafana showed a drop in throughput and a spike in errors.

🛠 Mitigation Steps:

Set threshold alerts for each signal.

Autoscale pods based on CPU & memory.

Implement caching and circuit breakers.

Review signal dashboards weekly with Dev/Infra teams.



---

✅ 9) What is throughput?

📘 Explanation:
Throughput is the number of requests a system can handle per second (RPS or TPS).

🧩 Real-time Scenario:
An API slowed down during load testing. Throughput dropped from 1200 to 400 RPS due to lack of DB connection pooling.

🛠 Mitigation Steps:

Add caching layers (e.g., Redis).

Optimize DB queries.

Use load balancers and rate limiting.

Scale horizontally with HPA (Horizontal Pod Autoscaler).



---

✅ 10) Have you ever implemented SLI, SLO, SLA?

Term Description Example

SLI Metric being measured 99.93% API uptime
SLO Target for the metric ≥99.95% uptime
SLA Contractual obligation ≥99.9% uptime or customer penalty


🧩 Real-time Scenario:
SLI: Uptime % from Prometheus
SLO: 99.95% monthly
SLA: 99.9% with customer refunds clause

🛠 Mitigation Steps:

Integrate SLI dashboards with alerts.

Pause releases when error budget is exhausted.

Conduct quarterly SLO reviews with business.

Automate reporting for compliance (SLA breach reports).



---

Would you like this in a PDF or PowerPoint format for interview prep or demo?


✅ What is a Glass Table in Splunk?

A Glass Table in Splunk is a custom, real-time, visual dashboard used to monitor services, infrastructure, applications, or business processes in a way that reflects how the business sees them. It is especially useful for SREs and platform teams to communicate operational health in a single pane of glass view.


---

🧩 Real-World Analogy

Think of a Glass Table as a control room dashboard in an airport:

Each dial shows runway status, fuel supply, weather, etc.

Everything is connected, visual, and easy to act upon.


In Splunk:

The airplane systems = your microservices.

The dashboard = a glass table.

The health metrics = Splunk queries/data models.



---

🔧 How is it Used in Real-Time Scenarios?

✅ Example Use Case:

Monitoring a Wealth Management Platform in production.

🔹 Top-level nodes: App tiers like UI, middleware, DB, APIs.

🔹 Overlays: KPIs like latency, availability, error rates, queue depth.

🔹 Dynamic color coding: Green (healthy), Yellow (warning), Red (critical).

🔹 Click-throughs: Clicking on a red component links to the corresponding Splunk dashboard, log search, or alert.



---

⚙️ Key Components

Component Purpose

Data Sources Splunk Metrics, Logs, Alerts, Service KPIs
Visual Shapes Circles, squares, lines to represent components
Thresholds Set color changes based on SLOs or thresholds
Drilldowns Link to other dashboards or searches



---

🔄 Step-by-Step to Create a Glass Table

1. Go to ITSI (IT Service Intelligence) in Splunk.


2. Navigate to Glass Tables > Create New Table.


3. Drag & drop shapes to represent services, infra, tiers.


4. Bind metrics using KPIs or SPL queries.


5. Configure thresholds to define color changes.


6. Add click-throughs or links to more detailed views.


7. Save and share with stakeholders.




---

🛠️ Example: Displaying API Latency & Errors

Add a circle for the API Gateway.

Bind with:

avg(latency) by API

Set thresholds:

Green < 100 ms

Yellow = 100–300 ms

Red > 300 ms


Add another visual element showing:

count(status=5xx) by API

Link both to the root-cause dashboard.



---

🧯 Mitigation/Operational Benefit

Fast RCA (Root Cause Analysis)

Business-aligned monitoring: Shows impact to services, not just metrics

Stakeholder communication: Product teams can see issues visually

Prioritization: Helps SREs focus on red zones first



---

🧠 Tips:

Integrate with Service Analyzer KPIs.

Keep it simple but meaningful (avoid overloading).

Use historical drilldowns to analyze SLO burn rate.



---

Would you like a sample Glass Table design, YAML export, or help building one for your environment (e.g., OpenShift apps with Splunk)?



No comments:

Post a Comment