Principal Site Reliability Engineer | Microsoft Careers

Microsoft

Actively hiring Posted this week
US Posted 3 days ago $142,800$274,800 / year

At a glance

AI generated

TL;DR

As a Principal Site Reliability Engineer in Microsoft’s Incident Response SRE team, you will play a pivotal role in maintaining the resilience and reliability of Substrate and MSAI services. Your responsibilities include leading high-severity incident responses, enhancing observability through telemetry and alerting systems, defining service level indicators and objectives, conducting live site health reviews, and translating learnings into proactive engineering practices to prevent future incidents. You will also design and execute reliability drills to validate resilience strategies and develop process documentation for incident management. This role requires expertise in software engineering, network engineering, or systems administration, with a preference for experience in large-scale cloud or distributed systems. The team operates at the cutting edge of global service health management, ensuring that Microsoft 365 remains resilient and continuously improving through data-driven practices and automation.

Skills

Python Go Kubernetes Docker Terraform AWS CI/CD Prometheus Grafana PostgreSQL SLI/SLO One Microsoft tooling Blameless retrospectives Reliability drills

What you'll do

  • Lead high-severity incident response and drive incidents to resolution with clear communication.
  • Enhance telemetry, alerting, and dashboards using One Microsoft tooling to improve observability.
  • Establish and track SLIs/SLOs for critical scenarios in partnership with engineering teams.
  • Translate business requirements into metrics and action during live site health reviews.
  • Design and execute drills simulating product failures to validate resilience and recovery strategies.

What we're looking for

  • Doctorate, Master's, or Bachelor's degree in Computer Science, Information Technology, or related field.
  • 3+ years of technical experience in software engineering, network engineering, or systems administration.
  • Experience leading high-severity incident response and driving systemic improvements.
  • Strong skills in enhancing observability through telemetry, alerting, and dashboards.
  • Ability to define and measure reliability using SLIs/SLOs for critical scenarios.
  • 7+ years of experience working with large-scale cloud or distributed systems preferred.

Employer

About Microsoft

Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing

Microsoft currently has 534 open roles on FindRole.

Listed pay typically runs $119,800–$234,700 across 488 roles with salary data.

Most-posted roles

View all roles at Microsoft