Capacity & Efficiency Infrastructure | Microsoft Careers

Microsoft

Hybrid

Quick summary

Work type
Hybrid
Location
Mountain View, CA
Salary
$119,800–$234,700 / yr
Posted
77 days ago
Closes
Sep 16, 2026

Market check

Salary context

Competitive pay

How this pay compares to similar roles

Similar $180k
This role $177k
$106k most similar roles pay here $248k

This role pays more than 65% of similar roles. Most pay $152,150–$208,800 — the shaded band above. At the midpoint, this role pays about $177k versus about $180k for comparable roles.

Based on 239 similar postings.

Employer

About Microsoft

Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing

Microsoft currently has 728 open roles on FindRole.

Listed pay typically runs $119,800–$234,700 across 664 roles with salary data.

Most-posted roles

View all roles at Microsoft

At a glance

TL;DR · Capacity & Efficiency Infrastructure | Microsoft Careers

Microsoft AI is hiring a Member of Technical Staff – Capacity & Efficiency Infrastructure to enhance and optimize the compute fleet supporting cutting-edge AI models. This role involves designing and implementing distributed training infrastructure in Python and C++ for large GPU clusters, building telemetry systems for performance visibility, profiling bottlenecks across various subsystems, and collaborating with ML researchers and hardware teams to scale up research and ensure efficiency. The ideal candidate will have extensive experience in GPU architectures, low-level programming (CUDA, NCCL), and frameworks like PyTorch or JAX, along with a deep understanding of networking and storage systems. This position demands expertise in profiling large-scale distributed computing systems and optimizing collective communication libraries for emerging hardware technologies, contributing to the development of frontier-scale models and humanist superintelligence research.

What you'll do

  • Design and optimize distributed training infrastructure for GPU clusters using Python and C++.
  • Build telemetry systems to monitor performance, utilization, and cost metrics in ML models.
  • Profile and debug performance bottlenecks across compute, memory, networking, and storage subsystems.
  • Drive architectural improvements in ML services to enhance efficiency and reliability.
  • Develop tools to provide insights and recommendations for improving fleet-wide efficiency.
  • Optimize collective communication libraries for emerging hardware topologies like NVLink and InfiniBand.

What we're looking for

  • 6+ years of technical engineering experience with coding in C++, Python, or related languages
  • Bachelor’s Degree in Computer Science or a related field
  • Deep understanding of GPU architectures and distributed training parallelism
  • Experience in profiling, benchmarking, and debugging large-scale distributed systems
  • Proficiency in low-level GPU programming (CUDA, Triton) and ML frameworks like PyTorch
  • Track record of contributing to high-performance computing or AI infrastructure projects
  • Knowledge of networking technologies such as InfiniBand and NVLink

More like this

Similar roles

| Microsoft Careers

Microsoft

CA 172 days ago $119,800$234,700
Python C# C++ Rust Java AWS Azure GCP Docker Kubernetes nginx RDBMS key-value stores APIs CI/CD
Hybrid

Azure Infrastructure Engineer

Northern Trust

Chicago, IL 24 days ago $99,600$169,200
Azure Terraform Bicep GitHub Actions GitOps Ruby Java Spring Boot ACA (Azure Container Apps) App Services Functions Key Vault App Config SQL Server RBAC Entra ID MFA Prometheus Grafana CI/CD

Lead Azure Infrastructure Engineer

Lam Research

Fremont, CA 26 days ago $141,000$307,000
Azure Python PowerShell CI/CD Application Insights Azure Monitor Log Analytics KQL Azure API Management AKS Kubernetes
Hybrid

Application Infrastructure Engineer

Booz Allen Hamilton

Fort Meade, MD 9 days ago $86,900$198,000
VMware vSphere NSX Linux Windows Shell Scripting PowerShell Python Cisco Juniper Infrastructure as Code CI/CD

Application, Infrastructure & Service Management, VP

State Street

Burlington, MA 4 days ago $120,000$202,500
Python Perl PowerShell SQL MS SQL Server Charles River IMS FIX Disaster Recovery Business Continuity System Monitoring Software Job Scheduling Software High Availability Systems Networks Servers Database Software