Site Reliability Engineer, HPC

Microsoft

Quick summary

Work type: On-site
Location: US
Salary: $142,800–$274,800 / yr
Posted: 134 days ago
Closes: Aug 16, 2026
Nearby: 53 roles within 25 mi

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $178k

This role $209k

$125k most similar roles pay here $291k

This role pays more than 73% of similar roles. Most pay $142,450–$214,000 — the shaded band above. At the midpoint, this role pays about $209k versus about $178k for comparable roles.

Based on 240 similar postings.

Employer

About Microsoft

Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing

Microsoft currently has 377 open roles on FindRole.

Listed pay typically runs $119,800–$234,700 across 355 roles with salary data.

Most-posted roles

View all roles at Microsoft

At a glance

TL;DR · Site Reliability Engineer, HPC

Role Posting Log in to save

As a Senior Site Reliability Engineer (SRE) in the HPC team, you will ensure the reliability and availability of high-performance computing clusters used for training and inference of MAI models. Your daily tasks include designing monitoring systems, automating deployments, leading incident management, and collaborating with ML engineers to enhance developer experience. You must be proficient in Kubernetes, Docker, CI/CD pipelines, public cloud platforms like Azure or AWS, and observability tools such as Grafana and Datadog. Strong programming skills in Python, Go, or Bash are essential, along with expertise in distributed systems and networking. This role offers the opportunity to work on cutting-edge infrastructure that impacts millions of users through reliable AI deployments.

Skills

Kubernetes Docker CI/CD Azure AWS GCP Terraform Python Go Bash Grafana Datadog OpenTelemetry Networking Storage GPU HPC Capacity_Planning Cost_Optimization

What you'll do

Ensure uptime and fault tolerance of HPC clusters for MAI model training.
Design and maintain monitoring systems for real-time visibility into HPC aspects.
Build automation for deployments and incident response in CPU+GPU environments.
Lead on-call rotations and conduct blameless postmortems to drive improvements.
Ensure data privacy and secure operations across all model training environments.

What we're looking for

2+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
Strong proficiency in Kubernetes, Docker, and container orchestration tools.
Expertise in monitoring and observability tools like Grafana, Datadog, OpenTelemetry.
Hands-on experience with public cloud platforms (Azure/AWS/GCP) and infrastructure-as-code.
Solid knowledge of distributed systems, networking, storage, and high-performance computing.
Experience running large-scale GPU clusters for ML/AI workloads preferred.
Strong programming/scripting skills in Python, Go, or Bash.

Similar roles

Senior Site Reliability Engineer, HPC

Nvidia

Santa Clara, CA +2 17 days ago $152,000–$241,500

AWS GCP OCI Kubernetes Slurm LSF CI/CD Terraform Python Go Perl Ruby Prometheus Grafana Docker Ansible GitOps AIOps PostgreSQL MySQL

Save