Capacity & Efficiency Infrastructure

Microsoft

Hybrid

Quick summary

Work type: Hybrid
Location: Mountain View, CA
Salary: $119,800–$234,700 / yr
Posted: 99 days ago
Closes: Sep 16, 2026
Nearby: 99+ roles within 25 mi

Market check

Salary context

Competitive pay

How this pay compares to similar roles

Similar $176k

This role $177k

$106k most similar roles pay here $248k

This role pays more than 56% of similar roles. Most pay $140,987–$210,312 — the shaded band above. At the midpoint, this role pays about $177k versus about $176k for comparable roles.

Based on 240 similar postings.

Employer

About Microsoft

Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing

Microsoft currently has 622 open roles on FindRole.

Listed pay typically runs $119,800–$234,700 across 571 roles with salary data.

Most-posted roles

View all roles at Microsoft

At a glance

TL;DR · Capacity & Efficiency Infrastructure

Role Posting Log in to save

Microsoft AI is hiring a Member of Technical Staff – Capacity & Efficiency Infrastructure to enhance and optimize the compute fleet supporting cutting-edge AI models. This role involves designing and implementing distributed training infrastructure in Python and C++ for large GPU clusters, building telemetry systems for performance visibility, profiling bottlenecks across various subsystems, and collaborating with ML researchers and hardware teams to scale up research and ensure efficiency. The ideal candidate will have extensive experience in GPU architectures, low-level programming (CUDA, NCCL), and frameworks like PyTorch or JAX, along with a deep understanding of networking and storage systems. This position demands expertise in profiling large-scale distributed computing systems and optimizing collective communication libraries for emerging hardware technologies, contributing to the development of frontier-scale models and humanist superintelligence research.

Skills

Python C++ CUDA PyTorch JAX NCCL InfiniBand NVLink Distributed_training_parallelism GPU_architectures Profiling_and_benchmarking Telemetry_systems High_performance_computing Large_scale_AI_infrastructure

What you'll do

Design and optimize distributed training infrastructure for GPU clusters using Python and C++.
Build telemetry systems to monitor performance, utilization, and cost metrics in ML models.
Profile and debug performance bottlenecks across compute, memory, networking, and storage subsystems.
Drive architectural improvements in ML services to enhance efficiency and reliability.
Develop tools to provide insights and recommendations for improving fleet-wide efficiency.
Optimize collective communication libraries for emerging hardware topologies like NVLink and InfiniBand.

What we're looking for

6+ years of technical engineering experience with coding in C++, Python, or related languages
Bachelor’s Degree in Computer Science or a related field
Deep understanding of GPU architectures and distributed training parallelism
Experience in profiling, benchmarking, and debugging large-scale distributed systems
Proficiency in low-level GPU programming (CUDA, Triton) and ML frameworks like PyTorch
Track record of contributing to high-performance computing or AI infrastructure projects
Knowledge of networking technologies such as InfiniBand and NVLink

Similar roles

Senior Infrastructure Capacity Engineer

F5 Inc

Remote (Seattle, WA) +4 3 days ago $161,600–$242,400

Python Prometheus Grafana SQL Excel CI/CD AWS Kubernetes Terraform Docker AI-driven analytics Capacity planning platforms PostgreSQL DevOps methodologies Infrastructure as Code Scalable infrastructure design Network bandwidth analysis Security appliance capacity management Physical data center constraints Advanced Excel modeling Operational resilience planning

Remote

Save

Infrastructure Engineering

State Street

Boston, MA 34 days ago $130,000–$180,000

Firewalls Routers and Switches SIEM Packet decoding and analysis Web Proxy Servers Load Balancing (F5 LTM, GTM, ASM) SSL VPN Solutions Web application firewalls Data loss prevention technologies Network security solutions CI/CD

Hybrid

Save

Lead Infrastructure Engineer, Capacity Management

JPMorgan Chase

Jersey City, NJ 12 days ago $142,500–$185,000

Kubernetes Python R Java Tableau Grafana BMC Truesight Dynatrace AppDynamics ARIMA Prophet OpenAI Gemini Claude LangChain LangGraph Pivotal Cloud Foundry Excel

Save

Infrastructure SRE

IBM

Austin, TX 50 days ago

Kubernetes Terraform Ansible Python Linux Unix AIX Windows Prometheus Grafana AWS CI/CD IaC OpenShift Jira Networking Security+ CISSP Agile Release Management Change Management

Save

Enterprise Infrastructure Manager, Enterprise Capacity & Performance Engineering

PNC

Pittsburgh, PA +4 3 days ago $100,100–$204,490

AWS Azure GCP Linux Windows Oracle SQL_Server MongoDB Java Prometheus Turbonomic Dynatrace vROPs DevOps SRE ITIL CI/CD

Save

Infrastructure Engineering

State Street

Quincy, MA 40 days ago $70,000–$118,750

ServiceNow ITIL CI/CD AWS Kubernetes PostgreSQL Docker Prometheus Grafana Python SQL Git Jira Confluence Linux Azure Google Cloud Platform Terraform

Save