Principal Site Reliability Engineer | Microsoft Careers
At a glance
AI generatedTL;DR
As a Principal Site Reliability Engineer in Microsoft’s Incident Response SRE team, you will play a pivotal role in maintaining the resilience and reliability of Substrate and MSAI services. Your responsibilities include leading high-severity incident responses, enhancing observability through telemetry and alerting systems, defining service level indicators and objectives, conducting live site health reviews, and translating learnings into proactive engineering practices to prevent future incidents. You will also design and execute reliability drills to validate resilience strategies and develop process documentation for incident management. This role requires expertise in software engineering, network engineering, or systems administration, with a preference for experience in large-scale cloud or distributed systems. The team operates at the cutting edge of global service health management, ensuring that Microsoft 365 remains resilient and continuously improving through data-driven practices and automation.
Skills
What you'll do
- Lead high-severity incident response and drive incidents to resolution with clear communication.
- Enhance telemetry, alerting, and dashboards using One Microsoft tooling to improve observability.
- Establish and track SLIs/SLOs for critical scenarios in partnership with engineering teams.
- Translate business requirements into metrics and action during live site health reviews.
- Design and execute drills simulating product failures to validate resilience and recovery strategies.
What we're looking for
- Doctorate, Master's, or Bachelor's degree in Computer Science, Information Technology, or related field.
- 3+ years of technical experience in software engineering, network engineering, or systems administration.
- Experience leading high-severity incident response and driving systemic improvements.
- Strong skills in enhancing observability through telemetry, alerting, and dashboards.
- Ability to define and measure reliability using SLIs/SLOs for critical scenarios.
- 7+ years of experience working with large-scale cloud or distributed systems preferred.
Employer
About Microsoft
Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing
Microsoft currently has 534 open roles on FindRole.
Listed pay typically runs $119,800–$234,700 across 488 roles with salary data.
Most-posted roles
- | Microsoft Careers 121
- Principal Software Engineer | Microsoft Careers 19
- Senior Software Engineer | Microsoft Careers 18
- Software Engineer II | Microsoft Careers 10
- Principal Applied Scientist | Microsoft Careers 5