Member of Technical Staff, Capacity & Efficiency Infrastructure - MAI Superintelligence Team | Microsoft Careers

Microsoft

Hybrid Actively hiring

Mountain View, CA Posted 72 days ago $119,800–$234,700 / year

View original post Log in to save

At a glance

AI generated

TL;DR

Microsoft AI is hiring a Member of Technical Staff – Capacity & Efficiency Infrastructure to enhance and optimize the compute fleet supporting cutting-edge AI models. This role involves designing and implementing distributed training infrastructure in Python and C++ for large GPU clusters, building telemetry systems for performance visibility, profiling bottlenecks across various subsystems, and collaborating with ML researchers and hardware teams to scale up research and ensure efficiency. The ideal candidate will have extensive experience in GPU architectures, low-level programming (CUDA, NCCL), and frameworks like PyTorch or JAX, along with a deep understanding of networking and storage systems. This position demands expertise in profiling large-scale distributed computing systems and optimizing collective communication libraries for emerging hardware technologies, contributing to the development of frontier-scale models and humanist superintelligence research.

Skills

Python C++ CUDA PyTorch JAX NCCL InfiniBand NVLink Distributed_training_parallelism GPU_architectures Profiling_and_benchmarking Telemetry_systems High_performance_computing Large_scale_AI_infrastructure

What you'll do

Design and optimize distributed training infrastructure for GPU clusters using Python and C++.
Build telemetry systems to monitor performance, utilization, and cost metrics in ML models.
Profile and debug performance bottlenecks across compute, memory, networking, and storage subsystems.
Drive architectural improvements in ML services to enhance efficiency and reliability.
Develop tools to provide insights and recommendations for improving fleet-wide efficiency.
Optimize collective communication libraries for emerging hardware topologies like NVLink and InfiniBand.

What we're looking for

6+ years of technical engineering experience with coding in C++, Python, or related languages
Bachelor’s Degree in Computer Science or a related field
Deep understanding of GPU architectures and distributed training parallelism
Experience in profiling, benchmarking, and debugging large-scale distributed systems
Proficiency in low-level GPU programming (CUDA, Triton) and ML frameworks like PyTorch
Track record of contributing to high-performance computing or AI infrastructure projects
Knowledge of networking technologies such as InfiniBand and NVLink

Market check

Salary context

This $119,800–$234,700 range sits above 45% of similar postings on FindRole.

Peer median band

$139,900–$234,700

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$168,262–$213,400

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Microsoft

Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing

Microsoft currently has 534 open roles on FindRole.

Microsoft

Mountain View, CA 59 days ago $119,800–$234,700

JavaScript TypeScript React HTML CSS Python C C++ Java AWS Kubernetes CI/CD Git Visual_Studio_Code Docker Prometheus Grafana

Hybrid