engineering · Experienced
DevOps Engineer interview questions covering CI/CD, containerization, cloud infrastructure, and monitoring for Indian tech companies.
What is the difference between a Docker image and a Docker container?
Tip: Image: read-only template with filesystem layers (like a class). Container: running instance of an image with its own writable layer (like an object). Multiple containers share the same image layers, saving disk space.
What is CI/CD? Describe the stages of a typical pipeline.
Tip: CI: automatically build + test on every commit. CD: automatically deploy to staging or production. Typical stages: source, build, unit test, integration test, security scan, staging deploy, smoke test, production deploy.
What is Kubernetes? Explain pods, deployments, and services.
Tip: Pod: smallest deployable unit, one or more co-located containers. Deployment: manages replica sets, handles rolling updates and rollbacks. Service: stable DNS name + load balancing across pods — because pod IPs are ephemeral.
What is infrastructure-as-code (IaC)? Compare Terraform and AWS CloudFormation.
Tip: IaC: manage infra via version-controlled config files. Terraform: cloud-agnostic, HCL language, state file management. CloudFormation: AWS-native, YAML/JSON, deeper AWS integration. Terraform is the industry standard for multi-cloud.
What is a blue-green deployment? How does it reduce downtime?
Tip: Two identical production environments (blue = live, green = new version). Deploy to green, run smoke tests, then switch the load balancer. Instant rollback by switching back to blue. Drawback: 2x infrastructure cost. Canary deployments are a cost-efficient alternative.
What is Prometheus? What are metrics, labels, and a PromQL example?
Tip: Prometheus is a pull-based time-series monitoring system. Metrics: counters, gauges, histograms. Labels: key-value dimensions. Example: `rate(http_requests_total{status="500"}[5m])` — 500 error rate over 5 minutes. Pair with Grafana for dashboards.
Tell me about a manual operational process you automated. What was the impact?
Tip: Automation impact is most convincing when quantified: "reduced deploy time from 45 minutes to 8 minutes." Name the tool (Bash script, Ansible, Terraform, GitHub Actions) and what the automation eliminated.
How do you respond to a production incident? Walk me through your process.
Tip: Alert fires, acknowledge, assess severity/blast radius, communicate status page update, mitigate (rollback or hotfix), resolve, post-mortem. Emphasise: communicate early even when uncertain, prioritise mitigation over root cause, write blameless post-mortems.
A container is crashing every 10 minutes. How do you diagnose it?
Tip: Check `kubectl describe pod` for events, `kubectl logs --previous` for crash logs. Common causes: OOMKilled (increase memory limit), failed liveness probe (probe timeout too short), missing env var, application panic. Read the exit code — 137 = OOM kill.
What is a Helm chart? How does it simplify Kubernetes deployment?
Tip: Helm is the Kubernetes package manager. A chart is a templated collection of Kubernetes manifests with configurable values. `helm install` renders templates with values and applies them. Useful for versioned, repeatable deployments with environment-specific overrides.
What is the difference between horizontal and vertical scaling?
Tip: Vertical: bigger machine (more CPU/RAM). Simple but has hard limits and single point of failure. Horizontal: more machines. Stateless services scale horizontally easily; stateful services (databases) need careful sharding or managed services.
Which cloud platform are you most experienced with? What services do you know best?
Tip: Be specific: name 5-8 services you have actually used (EC2, RDS, Lambda, S3, CloudWatch, IAM, EKS). Mention a real project context. "I know AWS" without specifics is a red flag.
Practice, not just reading
Upload your resume and practice a full DevOps Engineer mock interview with AI-generated questions and rubric-based scoring across 5 dimensions — free to start.