Principal Site Reliability Engineering Manager- CTJ- Secret (Cleared Environments) | Microsoft Careers

Microsoft

Actively hiring
US Posted 57 days ago $139,900$274,800 / year

At a glance

AI generated

TL;DR

As a Principal Site Reliability Engineering Manager at Microsoft Substrate, you will lead a team responsible for building and operating critical cloud services in highly regulated environments. Your day-to-day involves developing senior engineers, ensuring operational excellence through robust software engineering practices, and embedding reliability and compliance early in the service lifecycle. You will manage incident response, drive continuous improvement using SLOs and SLIs, and coordinate disaster recovery exercises to maintain high availability and security. The role requires expertise in cloud infrastructure, automation tools, and a deep understanding of regulatory requirements for environments like GCC Moderate, GCCH, and DoD. Strong skills in software engineering, network engineering, or systems administration are essential, along with the ability to obtain necessary security clearances and background investigations.

Skills

Python C# Kubernetes Terraform AWS Azure CI/CD PostgreSQL Docker Prometheus Grafana GitOps SLOs SLIs DR GCC Moderate GCC High Department of Defense Tier 3 background investigation Tier 5 background investigation CJIS eligibility

What you'll do

  • Lead and develop a team of Site Reliability Engineers to ensure operational excellence and reliability.
  • Own the operational health and reliability posture of Substrate services in regulated environments, ensuring high availability.
  • Drive change by establishing and managing SLOs, SLIs, and operational metrics for continuous improvement.
  • Serve as an on-call engineer, leading incident response and conducting post-incident reviews to enhance resilience.
  • Embed reliability, security, and compliance considerations early in service design and deployment decisions.

What we're looking for

  • Doctorate/Master's/Bachelor's degree in Computer Science or related field with 2+/3+/5+ years of technical experience.
  • Experience leading and developing a team of Site Reliability Engineers, fostering accountability and learning.
  • Strong software engineering fundamentals including clean code, robust telemetry, and disciplined lifecycle practices.
  • Ability to obtain and maintain Tier 3/Tier 5 background investigations for access to regulated cloud environments.
  • Proven track record in operating or supporting services in highly regulated, sovereign, or compliance-sensitive environments.

Market check

Salary context

This $139,900–$274,800 range sits above 78% of similar postings on FindRole.

Peer median band

$135,000$230,000

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$149,800$207,350

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Microsoft

Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing

Microsoft currently has 451 open roles on FindRole.

Listed pay typically runs $119,800–$234,700 across 417 roles with salary data.

Most-posted roles

View all roles at Microsoft

More like this

Similar roles

Senior Site Reliability Engineer | Microsoft Careers

Microsoft

US 106 days ago $119,800$234,700
Azure Kubernetes Terraform Python Go Docker CI/CD Prometheus Grafana GitOps Infrastructure-as-Code DNS CDN TLS Certificate Lifecycle Management Network Security Cloud Security Controls Identity-Driven Security Policies Microservices Patterns API Gateways Global Routing Architectures Automation Frameworks Scripting Distributed Tracing Metric Analysis Log Analysis

Site Reliability Engineer - CTJ - POLY | Microsoft Careers

Microsoft

US 97 days ago $119,800$234,700
Azure Kubernetes Ansible CI/CD GitHub Actions Linux Rocky 9 Redhat Mariner Python Go Terraform AWS Prometheus Grafana Docker SLIs/SLOs Chaos Engineering Infrastructure as Code Telemetry Observability Metrica Logs Traces Blameless Postmortems