Site Reliability Engineer (HPC) | Microsoft Careers

Microsoft

Hybrid

Quick summary

Work type
Hybrid
Location
Mountain View, CA
Salary
$142,800–$274,800 / yr
Posted
108 days ago
Closes
Aug 16, 2026

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $178k
This role $209k
$126k most similar roles pay here $291k

This role pays more than 74% of similar roles. Most pay $146,125–$210,112 — the shaded band above. At the midpoint, this role pays about $209k versus about $178k for comparable roles.

Based on 240 similar postings.

Employer

About Microsoft

Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing

Microsoft currently has 728 open roles on FindRole.

Listed pay typically runs $119,800–$234,700 across 664 roles with salary data.

Most-posted roles

View all roles at Microsoft

At a glance

TL;DR · Site Reliability Engineer (HPC) | Microsoft Careers

Join our High Performance Computing (HPC) infrastructure team as an experienced Site Reliability Engineer (SRE), where you will blend software engineering with systems engineering to maintain the reliability and efficiency of large-scale distributed AI infrastructure. Your daily tasks include ensuring high uptime for HPC clusters, designing monitoring and alerting systems, building automation tools for deployments and incident response, managing security compliance, and collaborating with ML engineers to enhance developer experience. Ideal candidates have strong proficiency in Kubernetes, Docker, CI/CD pipelines, public cloud platforms like Azure or AWS, and monitoring tools such as Grafana and Datadog, along with expertise in Python, Go, Bash, distributed systems, and large-scale GPU clusters for AI workloads.

What you'll do

  • Ensure uptime, resiliency, and fault tolerance of HPC clusters for AI model training and inference.
  • Design and maintain monitoring systems to provide real-time visibility into GPU, cluster, storage, and networking aspects.
  • Build automation tools for deployments, incident response, scaling, and failover in CPU+GPU environments.
  • Lead on-call rotations, troubleshoot production issues, and conduct blameless postmortems to drive continuous improvements.
  • Ensure data privacy, compliance, and secure operations across model training and serving environments.

What we're looking for

  • Master's Degree in CS/IT or 2+ years of SRE/DevOps experience.
  • Bachelor's Degree in CS/IT or 4+ years of SRE/DevOps experience.
  • Proficiency in Kubernetes, Docker, and container orchestration.
  • Experience with public cloud platforms (Azure/AWS/GCP) and IaC.
  • Expertise in monitoring & observability tools like Grafana, Datadog.
  • Strong programming skills in Python, Go, or Bash.
  • Knowledge of distributed systems, networking, and storage.

More like this

Similar roles

Senior Site Reliability Engineer - HPC

Nvidia

Santa Clara, CA 101 days ago $152,000$241,500
AWS GCP OCI Kubernetes Slurm LSF CI/CD Terraform Python Go Perl Ruby Prometheus Grafana Docker Ansible GitOps AIOps PostgreSQL MySQL

| Microsoft Careers

Microsoft

WA 114 days ago $119,800$234,700
Azure Kubernetes Terraform Python Go Docker CI/CD Prometheus Grafana GitOps Infrastructure-as-Code DNS CDN TLS Certificate Lifecycle Management Network Security Cloud Security Controls Identity-Driven Security Policies Microservices Patterns API Gateways Global Routing Architectures Automation Frameworks Scripting Distributed Tracing Metric Analysis Log Analysis

Site Reliability Engineer - CTJ - POLY | Microsoft Careers

Microsoft

US 106 days ago $119,800$234,700
Azure Kubernetes Ansible CI/CD GitHub Actions Linux Rocky 9 Redhat Mariner Python Go Terraform AWS Prometheus Grafana Docker SLIs/SLOs Chaos Engineering Infrastructure as Code Telemetry Observability Metrica Logs Traces Blameless Postmortems