Site Reliability Engineer (HPC) | Microsoft Careers

Microsoft

Hybrid

Quick summary

Work type: Hybrid
Location: Mountain View, CA
Salary: $142,800–$274,800 / yr
Posted: 108 days ago
Closes: Aug 16, 2026
Nearby: 99+ roles within 25 mi

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $178k

This role $209k

$126k most similar roles pay here $291k

This role pays more than 74% of similar roles. Most pay $146,125–$210,112 — the shaded band above. At the midpoint, this role pays about $209k versus about $178k for comparable roles.

Based on 240 similar postings.

Employer

About Microsoft

Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing

Microsoft currently has 728 open roles on FindRole.

Listed pay typically runs $119,800–$234,700 across 664 roles with salary data.

Most-posted roles

View all roles at Microsoft

At a glance

TL;DR · Site Reliability Engineer (HPC) | Microsoft Careers

Apply Now Log in to save

Join our High Performance Computing (HPC) infrastructure team as an experienced Site Reliability Engineer (SRE), where you will blend software engineering with systems engineering to maintain the reliability and efficiency of large-scale distributed AI infrastructure. Your daily tasks include ensuring high uptime for HPC clusters, designing monitoring and alerting systems, building automation tools for deployments and incident response, managing security compliance, and collaborating with ML engineers to enhance developer experience. Ideal candidates have strong proficiency in Kubernetes, Docker, CI/CD pipelines, public cloud platforms like Azure or AWS, and monitoring tools such as Grafana and Datadog, along with expertise in Python, Go, Bash, distributed systems, and large-scale GPU clusters for AI workloads.

Skills

Kubernetes Docker CI/CD AWS Azure GCP Terraform Python Go Bash Grafana Datadog OpenTelemetry Networking Storage GPU High-Performance Computing(HPC) Capacity Planning Cost Optimization

What you'll do

Ensure uptime, resiliency, and fault tolerance of HPC clusters for AI model training and inference.
Design and maintain monitoring systems to provide real-time visibility into GPU, cluster, storage, and networking aspects.
Build automation tools for deployments, incident response, scaling, and failover in CPU+GPU environments.
Lead on-call rotations, troubleshoot production issues, and conduct blameless postmortems to drive continuous improvements.
Ensure data privacy, compliance, and secure operations across model training and serving environments.

What we're looking for

Master's Degree in CS/IT or 2+ years of SRE/DevOps experience.
Bachelor's Degree in CS/IT or 4+ years of SRE/DevOps experience.
Proficiency in Kubernetes, Docker, and container orchestration.
Experience with public cloud platforms (Azure/AWS/GCP) and IaC.
Expertise in monitoring & observability tools like Grafana, Datadog.
Strong programming skills in Python, Go, or Bash.
Knowledge of distributed systems, networking, and storage.

Similar roles

Senior Site Reliability Engineer - HPC

Nvidia

Santa Clara, CA 101 days ago $152,000–$241,500

AWS GCP OCI Kubernetes Slurm LSF CI/CD Terraform Python Go Perl Ruby Prometheus Grafana Docker Ansible GitOps AIOps PostgreSQL MySQL

Save

Site Reliability Engineer - CTJ - Poly | Microsoft Careers

Microsoft

US 7 days ago $119,800–$234,700

Azure Terraform Kubernetes Docker Python PowerShell Bicep ARM templates Spark Hadoop Synapse CI/CD PostgreSQL SQL Server Azure Key Vault Event Hubs Microsoft 365 C# Java

Save

Site Reliability Engineer - Data, Cloud & Developer Experience

Blackstone Inc

New York 601 Lex 113 days ago $140,000–$225,000

AWS Terraform Python Docker Grafana Prometheus CI/CD Kubernetes ECS EKS Puppet Gitlab Splunk

Save

| Microsoft Careers

Microsoft

WA 114 days ago $119,800–$234,700

Azure Kubernetes Terraform Python Go Docker CI/CD Prometheus Grafana GitOps Infrastructure-as-Code DNS CDN TLS Certificate Lifecycle Management Network Security Cloud Security Controls Identity-Driven Security Policies Microservices Patterns API Gateways Global Routing Architectures Automation Frameworks Scripting Distributed Tracing Metric Analysis Log Analysis

Save

HPC Operations Engineering Manager | Microsoft Careers

Microsoft

CA 115 days ago $165,600–$296,400

Kubernetes Docker Python Go Bash Azure AWS GCP Terraform CI/CD Grafana Datadog OpenTelemetry Prometheus PostgreSQL Redis HPC Slurm Ansible GitOps

Save

Site Reliability Engineer - CTJ - POLY | Microsoft Careers

Microsoft

US 106 days ago $119,800–$234,700

Azure Kubernetes Ansible CI/CD GitHub Actions Linux Rocky 9 Redhat Mariner Python Go Terraform AWS Prometheus Grafana Docker SLIs/SLOs Chaos Engineering Infrastructure as Code Telemetry Observability Metrica Logs Traces Blameless Postmortems

Save