Site Reliability Engineer, HPC

Microsoft

Quick summary

Work type
On-site
Location
US
Salary
$142,800–$274,800 / yr
Posted
134 days ago
Closes
Aug 16, 2026

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $178k
This role $209k
$125k most similar roles pay here $291k

This role pays more than 73% of similar roles. Most pay $142,450–$214,000 — the shaded band above. At the midpoint, this role pays about $209k versus about $178k for comparable roles.

Based on 240 similar postings.

Employer

About Microsoft

Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing

Microsoft currently has 377 open roles on FindRole.

Listed pay typically runs $119,800–$234,700 across 355 roles with salary data.

Most-posted roles

View all roles at Microsoft

At a glance

TL;DR · Site Reliability Engineer, HPC

As a Senior Site Reliability Engineer (SRE) in the HPC team, you will ensure the reliability and availability of high-performance computing clusters used for training and inference of MAI models. Your daily tasks include designing monitoring systems, automating deployments, leading incident management, and collaborating with ML engineers to enhance developer experience. You must be proficient in Kubernetes, Docker, CI/CD pipelines, public cloud platforms like Azure or AWS, and observability tools such as Grafana and Datadog. Strong programming skills in Python, Go, or Bash are essential, along with expertise in distributed systems and networking. This role offers the opportunity to work on cutting-edge infrastructure that impacts millions of users through reliable AI deployments.

What you'll do

  • Ensure uptime and fault tolerance of HPC clusters for MAI model training.
  • Design and maintain monitoring systems for real-time visibility into HPC aspects.
  • Build automation for deployments and incident response in CPU+GPU environments.
  • Lead on-call rotations and conduct blameless postmortems to drive improvements.
  • Ensure data privacy and secure operations across all model training environments.

What we're looking for

  • 2+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
  • Strong proficiency in Kubernetes, Docker, and container orchestration tools.
  • Expertise in monitoring and observability tools like Grafana, Datadog, OpenTelemetry.
  • Hands-on experience with public cloud platforms (Azure/AWS/GCP) and infrastructure-as-code.
  • Solid knowledge of distributed systems, networking, storage, and high-performance computing.
  • Experience running large-scale GPU clusters for ML/AI workloads preferred.
  • Strong programming/scripting skills in Python, Go, or Bash.

More like this

Similar roles

Senior Site Reliability Engineer, HPC

Nvidia

Santa Clara, CA +2 17 days ago $152,000$241,500
AWS GCP OCI Kubernetes Slurm LSF CI/CD Terraform Python Go Perl Ruby Prometheus Grafana Docker Ansible GitOps AIOps PostgreSQL MySQL

Site Reliability Engineer, HPC & Automation

SpaceX

Redmond, WA today $125,000$150,000
Python Bash Linux Docker Kubernetes PostgreSQL MySQL Terraform Ansible Puppet Grafana Prometheus Jenkins Slurm NFS REST NetAppONTAP Cadence Synopsys Ansys Siemens CI/CD

Site Reliability Engineer, Hardware Infrastructure

Nvidia

Santa Clara, CA 17 days ago $184,000$287,500
SRE DevOps Python Go Prometheus Grafana CI/CD Kubernetes Terraform AWS LLM Generative AI Agentic solutions Docker Git Jenkins Ansible PostgreSQL Redis Nginx

Site Reliability Engineer

Equifax

St. Louis, Missouri +1 78 days ago
AWS GCP Terraform Jenkins Python Bash Docker Kubernetes CI/CD Prometheus PostgreSQL Linux Windows Ansible Chef
Hybrid

Site Reliability Engineer

Shopify

Europe 62 days ago
Kubernetes Docker CI/CD Python Go PostgreSQL AWS GCP Prometheus Grafana Terraform GitOps