HPC Operations Engineering Manager | Microsoft Careers

Microsoft

Quick summary

Work type
On-site
Location
CA
Salary
$165,600–$296,400 / yr
Posted
115 days ago
Closes
Aug 10, 2026

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $188k
This role $231k
$129k most similar roles pay here $314k

This role pays more than 77% of similar roles. Most pay $148,500–$227,487 — the shaded band above. At the midpoint, this role pays about $231k versus about $188k for comparable roles.

Based on 240 similar postings.

Employer

About Microsoft

Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing

Microsoft currently has 728 open roles on FindRole.

Listed pay typically runs $119,800–$234,700 across 664 roles with salary data.

Most-posted roles

View all roles at Microsoft

At a glance

TL;DR · HPC Operations Engineering Manager | Microsoft Careers

As an experienced High Performance Computing Operations Engineering Manager on Microsoft AI’s SuperIntelligence Team, you will lead a team of Site Reliability Engineers to ensure the reliability and efficiency of large-scale distributed AI infrastructure. Your daily responsibilities include designing and maintaining monitoring systems for real-time visibility into model serving pipelines, building automation tools for hybrid cloud environments, managing incident response and conducting blameless postmortems, ensuring data privacy and compliance, and collaborating with ML engineers to improve developer experience. The role requires expertise in Kubernetes, Docker, Python, Go, Bash, and public cloud platforms like Azure, AWS, or GCP, along with a solid understanding of distributed systems, networking, and storage. This startup-like team focuses on advancing humanist superintelligence, aiming for breakthroughs that benefit society through ultra-capable AI systems anchored to human values.

What you'll do

  • Lead a team of SREs to ensure uptime and fault tolerance of AI model training systems.
  • Design and maintain monitoring systems for real-time visibility into model serving pipelines.
  • Build automation tools for deployments, incident response, and scaling in hybrid cloud environments.
  • Manage on-call rotations and conduct blameless postmortems to drive continuous improvements.
  • Ensure data privacy and compliance across all stages of model training and serving.

What we're looking for

  • Bachelor's Degree in Computer Science or related field and 8+ years of technical engineering experience.
  • Extensive experience with Kubernetes, Docker, and container orchestration (6+ years).
  • Proficiency in programming/scripting languages such as Python, Go, or Bash.
  • Leadership experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
  • Expertise in monitoring & observability tools like Grafana, Datadog, OpenTelemetry.
  • Experience with high-performance computing (HPC) and workload schedulers using Kubernetes operators.

More like this

Similar roles

HPC Operations Engineer

Nvidia

Santa Clara, CA 59 days ago $124,000$195,500
Centos RHEL Docker Python bash Ansible NFS LDAP DNS TCP/IP SLURM FlexLM Perl InfiniBand RDMA RoCE Lustre GPFS
Hybrid

Site Reliability Engineer (HPC) | Microsoft Careers

Microsoft

Mountain View, CA 108 days ago $142,800$274,800
Kubernetes Docker CI/CD AWS Azure GCP Terraform Python Go Bash Grafana Datadog OpenTelemetry Networking Storage GPU High-Performance Computing(HPC) Capacity Planning Cost Optimization
Hybrid

Senior HPC and LSF Operations Engineer

Nvidia

Santa Clara, CA 87 days ago $152,000$241,500
LSF Slurm Linux CentOS RHEL Docker Singularity Podman HPC Reliability Engineering Metrics Collection Monitoring Pipelines Alerting Strategies Performance Dashboards Container Technologies Job Scheduling Systems
Hybrid

HPC User Support Engineer

Argonne National Laboratory

Remote (Lemont, IL) 10 days ago $69,750$108,810
Python C/C++ FORTRAN UNIX PBSPro Git Jenkins Docker MPI OpenMP PostgreSQL HPC CI/CD
Remote

HPC Systems Administration Specialist

Argonne National Laboratory

Lemont, IL 129 days ago $69,750$108,810
Linux Spack Lmod Singularity Version control systems Compilers GCC Intel LLVM Make CMake Autotools Python CI pipelines YAML Podman MPI CUDA BLAS FFTW

HPC Systems Administration Specialist

Argonne National Laboratory

Lemont, IL 166 days ago $69,750$108,810
Linux Spack Lmod Singularity Python CI pipelines Make CMake Autotools GCC Intel Compilers LLVM YAML Podman Git