Principal Site Reliability Engineering Manager

Microsoft

Quick summary

Work type
On-site
Location
Salary
$142,800–$274,800 / yr
Posted
71 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $188k
This role $209k
$127k most similar roles pay here $291k

This role pays more than 67% of similar roles. Most pay $159,375–$216,423 — the shaded band above. At the midpoint, this role pays about $209k versus about $188k for comparable roles.

Based on 239 similar postings.

Employer

About Microsoft

Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing

Microsoft currently has 622 open roles on FindRole.

Listed pay typically runs $119,800–$234,700 across 571 roles with salary data.

Most-posted roles

View all roles at Microsoft

At a glance

TL;DR · Principal Site Reliability Engineering Manager

As a Principal Site Reliability Engineering Manager in Microsoft’s ES365 organization, you will lead a team of diverse SREs to enhance the reliability of large-scale engineering systems used by multiple divisions. Your day-to-day responsibilities include partnering with engineers and product managers to design and maintain reliable services, driving cross-organizational alignment through shared standards, and implementing service level objectives (SLOs) and indicators (SLIs). You will also foster a culture of continuous improvement by leading incident response and conducting Engineering Service Reviews. The role requires expertise in cloud services, particularly Azure, containerization, orchestration, and observability practices such as metrics, logs, and tracing. Additionally, you must have experience in reducing toil through automation and improving operational efficiency across build, validation, and deployment systems. This position is ideal for someone passionate about coaching and people leadership within a high-functioning team focused on customer impact and reliability at scale.

What you'll do

  • Partner with engineers to design and maintain reliable and resilient services.
  • Drive cross-organizational alignment through partnerships and shared reliability standards.
  • Build and retain a team of Site Reliability Engineers, providing mentorship and coaching.
  • Define and implement SLOs/SLIs for critical engineering systems to guide continuous improvement.
  • Lead incident management, including blameless post-incident reviews and corrective actions.
  • Drive automation to reduce operational toil and improve efficiency in build and deployment systems.

What we're looking for

  • 5+ years of experience leading large-scale initiatives involving multiple engineers.
  • Proven track record in reliability engineering for developer or platform services.
  • Experience in cross-disciplinary collaboration to align reliability priorities.
  • Expertise in architecting and operating enterprise-scale distributed cloud services.
  • Strong background in managing engineering systems processes with reliability practices.
  • Leadership in incident response, automation, and observability (metrics/logs/traces).
  • Deep understanding of containerization and orchestration technologies.

More like this

Similar roles

Principal Site Reliability Engineer

Upstart

Remote (Canada) 135 days ago $195,300$270,400
Python Go JavaScript TypeScript Terraform Datadog Prometheus RUM LLM GenAI CI/CD Kubernetes Docker AWS GCP Service Mesh Infrastructure as Code Self-healing systems On-call management Program management
Remote

Principal Site Reliability Engineer

The Walt Disney Company

Remote 79 days ago
AWS Azure GCP Terraform CloudFormation Ansible Chef CI/CD Docker Kubernetes Prometheus Grafana Python Linux Windows AI LLM PCI DevOps SRE SLI SLO SLA
Remote

Principal Site Reliability Engineer

The Walt Disney Company

Remote (Bay Lake, FL) 72 days ago
Akamai Kona Site Defender WAF Bot Manager DevOps CI/CD Python Go Docker Terraform AWS Azure Google Cloud PostgreSQL MongoDB Redis Prometheus Grafana Kubernetes Ansible Jenkins GitLab GitHub
Remote