Capacity & Efficiency Infrastructure

Microsoft

Hybrid

Quick summary

Work type
Hybrid
Location
Mountain View, CA
Salary
$119,800–$234,700 / yr
Posted
99 days ago
Closes
Sep 16, 2026

Market check

Salary context

Competitive pay

How this pay compares to similar roles

Similar $176k
This role $177k
$106k most similar roles pay here $248k

This role pays more than 56% of similar roles. Most pay $140,987–$210,312 — the shaded band above. At the midpoint, this role pays about $177k versus about $176k for comparable roles.

Based on 240 similar postings.

Employer

About Microsoft

Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing

Microsoft currently has 622 open roles on FindRole.

Listed pay typically runs $119,800–$234,700 across 571 roles with salary data.

Most-posted roles

View all roles at Microsoft

At a glance

TL;DR · Capacity & Efficiency Infrastructure

Microsoft AI is hiring a Member of Technical Staff – Capacity & Efficiency Infrastructure to enhance and optimize the compute fleet supporting cutting-edge AI models. This role involves designing and implementing distributed training infrastructure in Python and C++ for large GPU clusters, building telemetry systems for performance visibility, profiling bottlenecks across various subsystems, and collaborating with ML researchers and hardware teams to scale up research and ensure efficiency. The ideal candidate will have extensive experience in GPU architectures, low-level programming (CUDA, NCCL), and frameworks like PyTorch or JAX, along with a deep understanding of networking and storage systems. This position demands expertise in profiling large-scale distributed computing systems and optimizing collective communication libraries for emerging hardware technologies, contributing to the development of frontier-scale models and humanist superintelligence research.

What you'll do

  • Design and optimize distributed training infrastructure for GPU clusters using Python and C++.
  • Build telemetry systems to monitor performance, utilization, and cost metrics in ML models.
  • Profile and debug performance bottlenecks across compute, memory, networking, and storage subsystems.
  • Drive architectural improvements in ML services to enhance efficiency and reliability.
  • Develop tools to provide insights and recommendations for improving fleet-wide efficiency.
  • Optimize collective communication libraries for emerging hardware topologies like NVLink and InfiniBand.

What we're looking for

  • 6+ years of technical engineering experience with coding in C++, Python, or related languages
  • Bachelor’s Degree in Computer Science or a related field
  • Deep understanding of GPU architectures and distributed training parallelism
  • Experience in profiling, benchmarking, and debugging large-scale distributed systems
  • Proficiency in low-level GPU programming (CUDA, Triton) and ML frameworks like PyTorch
  • Track record of contributing to high-performance computing or AI infrastructure projects
  • Knowledge of networking technologies such as InfiniBand and NVLink

More like this

Similar roles

Senior Infrastructure Capacity Engineer

F5 Inc

Remote (Seattle, WA) +4 3 days ago $161,600$242,400
Python Prometheus Grafana SQL Excel CI/CD AWS Kubernetes Terraform Docker AI-driven analytics Capacity planning platforms PostgreSQL DevOps methodologies Infrastructure as Code Scalable infrastructure design Network bandwidth analysis Security appliance capacity management Physical data center constraints Advanced Excel modeling Operational resilience planning
Remote

Infrastructure Engineering

State Street

Boston, MA 34 days ago $130,000$180,000
Firewalls Routers and Switches SIEM Packet decoding and analysis Web Proxy Servers Load Balancing (F5 LTM, GTM, ASM) SSL VPN Solutions Web application firewalls Data loss prevention technologies Network security solutions CI/CD
Hybrid

Lead Infrastructure Engineer, Capacity Management

JPMorgan Chase

Jersey City, NJ 12 days ago $142,500$185,000
Kubernetes Python R Java Tableau Grafana BMC Truesight Dynatrace AppDynamics ARIMA Prophet OpenAI Gemini Claude LangChain LangGraph Pivotal Cloud Foundry Excel

Infrastructure SRE

IBM

Austin, TX 50 days ago
Kubernetes Terraform Ansible Python Linux Unix AIX Windows Prometheus Grafana AWS CI/CD IaC OpenShift Jira Networking Security+ CISSP Agile Release Management Change Management

Infrastructure Engineering

State Street

Quincy, MA 40 days ago $70,000$118,750
ServiceNow ITIL CI/CD AWS Kubernetes PostgreSQL Docker Prometheus Grafana Python SQL Git Jira Confluence Linux Azure Google Cloud Platform Terraform