Sr./Staff ML Infrastructure Engineer, Compute (TPU Scheduling) - Foundation Model

Apple Inc

Quick summary

Work type: On-site
Location: Santa Clara, CA
Salary: $181,100–$318,400 / yr
Posted: 27 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $217k

This role $250k

$144k most similar roles pay here $337k

This role pays more than 77% of similar roles. Most pay $184,562–$249,750 — the shaded band above. At the midpoint, this role pays about $250k versus about $217k for comparable roles.

Based on 240 similar postings.

Employer

About Apple Inc

Apple Inc. is a multinational technology company known for designing and manufacturing consumer electronics, software, and online services, including the iPhone, Mac, iPad, and App Store. Industry: Consumer Electronics & Software

Apple Inc currently has 638 open roles on FindRole.

Listed pay typically runs $171,600–$272,100 across 505 roles with salary data.

Most-posted roles

View all roles at Apple Inc

At a glance

TL;DR · Sr./Staff ML Infrastructure Engineer, Compute (TPU Scheduling) - Foundation Model

Apply Now Log in to save

As a Senior/Staff ML Infrastructure Engineer on the Foundation Model Compute Infrastructure team, you will lead the design and development of scheduling and orchestration systems for TPU-based workloads across multi-region clusters. Your responsibilities include building topology-aware, quota-aware, and fault-tolerant schedulers to enhance utilization and reliability while collaborating with foundation model teams to support advanced distributed training frameworks like Pathways and JAX. You will also develop automation tools for provisioning and resource management in Kubernetes environments, ensuring efficient cluster operations at scale. This role requires expertise in Python, Go, C++, Kubernetes, and large-scale cluster management systems, along with a strong background in distributed systems, scalability, and performance engineering. Experience with TPU infrastructure and distributed ML training frameworks is highly valued as you tackle the challenges of managing complex AI compute platforms.

Skills

Python Kubernetes TPU Go C++ Docker JAX PyTorch TensorFlow Ray Pathways Prometheus Grafana CI/CD AWS Azure Google Cloud Platform

What you'll do

Design and evolve scheduling systems for TPU-based workloads across multi-region clusters.
Build topology-aware schedulers to enhance utilization and reliability of TPU infrastructure.
Develop orchestration systems for distributed ML workloads on Kubernetes and accelerator infra.
Automate provisioning, resource management, and recovery handling to improve cluster efficiency.
Mentor engineers and influence architectural direction in Apple’s AI compute platform.

What we're looking for

7+ years of experience building large-scale distributed systems or cloud infrastructure
Strong programming skills in Python, Go, C++, or similar languages
Extensive experience with compute infrastructure and workload scheduling
Expertise in distributed systems, scalability, reliability, and performance engineering
Experience with Kubernetes, container orchestration, or cluster management systems
Bachelor’s degree in Computer Science, Engineering, or related field