L1 Site Reliability Engineer responsible for monitoring, triaging, and executing standard operational tasks across enterprise applications
Supports Kubernetes, APIs, WAF, databases, API gateways (Gloo, Apigee), Kafka, and multi-cloud environments (AWS/Azure/GCP)
First line of defense for incident detection, troubleshooting, and escalation using runbooks and automation
Key Responsibilities
Monitoring & Infrastructure
- Monitor systems using Grafana, Datadog, Splunk, Prometheus, and AIOps tools
- Detect anomalies and follow alert workflows for resolution or escalation
- Validate Kubernetes issues using monitoring dashboards and logs
Runbook Execution
- Follow predefined runbooks for incident resolution
- Restart services, validate system health, and elevate when procedures fail
- Ensure adherence to operational standards
- Perform initial incident triage and severity classificatio...