Member of Technical Staff, Site Reliability Engineer (HPC) - MAI SuperIntelligence Team | Microsoft Careers

Microsoft

Hybrid Actively hiring
Mountain View, CA Posted 103 days ago $139,900$274,800 / year

At a glance

AI generated

TL;DR

Join our High Performance Computing (HPC) infrastructure team as an experienced Site Reliability Engineer (SRE), where you will blend software engineering with systems engineering to maintain the reliability and efficiency of large-scale distributed AI infrastructure. Your daily tasks include ensuring high uptime for HPC clusters, designing monitoring and alerting systems, building automation tools for deployments and incident response, managing security compliance, and collaborating with ML engineers to enhance developer experience. Ideal candidates have strong proficiency in Kubernetes, Docker, CI/CD pipelines, public cloud platforms like Azure or AWS, and monitoring tools such as Grafana and Datadog, along with expertise in Python, Go, Bash, distributed systems, and large-scale GPU clusters for AI workloads.

Skills

Kubernetes Docker CI/CD AWS Azure GCP Terraform Python Go Bash Grafana Datadog OpenTelemetry Networking Storage GPU High-Performance Computing(HPC) Capacity Planning Cost Optimization

What you'll do

  • Ensure uptime, resiliency, and fault tolerance of HPC clusters for AI model training and inference.
  • Design and maintain monitoring systems to provide real-time visibility into GPU, cluster, storage, and networking aspects.
  • Build automation tools for deployments, incident response, scaling, and failover in CPU+GPU environments.
  • Lead on-call rotations, troubleshoot production issues, and conduct blameless postmortems to drive continuous improvements.
  • Ensure data privacy, compliance, and secure operations across model training and serving environments.

What we're looking for

  • Master's Degree in CS/IT or 2+ years of SRE/DevOps experience.
  • Bachelor's Degree in CS/IT or 4+ years of SRE/DevOps experience.
  • Proficiency in Kubernetes, Docker, and container orchestration.
  • Experience with public cloud platforms (Azure/AWS/GCP) and IaC.
  • Expertise in monitoring & observability tools like Grafana, Datadog.
  • Strong programming skills in Python, Go, or Bash.
  • Knowledge of distributed systems, networking, and storage.

Market check

Salary context

This $139,900–$274,800 range sits above 71% of similar postings on FindRole.

Peer median band

$139,900$234,700

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$167,900$213,656

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Microsoft

Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing

Microsoft currently has 534 open roles on FindRole.

Listed pay typically runs $119,800–$234,700 across 488 roles with salary data.

Most-posted roles

View all roles at Microsoft

More like this

Similar roles