← Back to London Jobs
Position Details
Responsibilities
- Collaborate with technology leaders and stakeholders to define the SRE strategy and best practices for ensuring the reliability, scalability, and performance of critical systems and services.
- Oversee the incident management and response process within the chapter.
- Establish and enforce monitoring and alerting best practices by configuring and tuning events, logs, metrics, and traces.
- Define appropriate SLOs and SLIs with product teams.
- Encourage development and automation to streamline SRE processes, including incident response, system provisioning, monitoring, configuration management, and knowledge management.
- Analyze new services (in production or design stages) to align them with industry best practices and the CTC monitoring framework.
- Track and monitor the performance and progress of SRE-related initiatives.
- Maintain dashboards and lead operational reviews covering performance tren...