Sr./Staff ML Infrastructure Engineer, Compute (TPU Scheduling) - Foundation Model

Apple Inc

Quick summary

Work type
On-site
Location
Santa Clara, CA
Salary
$181,100–$318,400 / yr
Posted
27 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $217k
This role $250k
$144k most similar roles pay here $337k

This role pays more than 77% of similar roles. Most pay $184,562–$249,750 — the shaded band above. At the midpoint, this role pays about $250k versus about $217k for comparable roles.

Based on 240 similar postings.

Employer

About Apple Inc

Apple Inc. is a multinational technology company known for designing and manufacturing consumer electronics, software, and online services, including the iPhone, Mac, iPad, and App Store. Industry: Consumer Electronics & Software

Apple Inc currently has 638 open roles on FindRole.

Listed pay typically runs $171,600–$272,100 across 505 roles with salary data.

Most-posted roles

View all roles at Apple Inc

At a glance

TL;DR · Sr./Staff ML Infrastructure Engineer, Compute (TPU Scheduling) - Foundation Model

As a Senior/Staff ML Infrastructure Engineer on the Foundation Model Compute Infrastructure team, you will lead the design and development of scheduling and orchestration systems for TPU-based workloads across multi-region clusters. Your responsibilities include building topology-aware, quota-aware, and fault-tolerant schedulers to enhance utilization and reliability while collaborating with foundation model teams to support advanced distributed training frameworks like Pathways and JAX. You will also develop automation tools for provisioning and resource management in Kubernetes environments, ensuring efficient cluster operations at scale. This role requires expertise in Python, Go, C++, Kubernetes, and large-scale cluster management systems, along with a strong background in distributed systems, scalability, and performance engineering. Experience with TPU infrastructure and distributed ML training frameworks is highly valued as you tackle the challenges of managing complex AI compute platforms.

What you'll do

  • Design and evolve scheduling systems for TPU-based workloads across multi-region clusters.
  • Build topology-aware schedulers to enhance utilization and reliability of TPU infrastructure.
  • Develop orchestration systems for distributed ML workloads on Kubernetes and accelerator infra.
  • Automate provisioning, resource management, and recovery handling to improve cluster efficiency.
  • Mentor engineers and influence architectural direction in Apple’s AI compute platform.

What we're looking for

  • 7+ years of experience building large-scale distributed systems or cloud infrastructure
  • Strong programming skills in Python, Go, C++, or similar languages
  • Extensive experience with compute infrastructure and workload scheduling
  • Expertise in distributed systems, scalability, reliability, and performance engineering
  • Experience with Kubernetes, container orchestration, or cluster management systems
  • Bachelor’s degree in Computer Science, Engineering, or related field

More like this

Similar roles

Staff ML Infrastructure Engineer (Compute)

General Motors (GM)

Remote (Gm Automation - Sunnyvale - Gm Automation - Sunnyvale, US) 9 days ago $197,000$326,000
Kubernetes Docker Go AWS GCP Azure CI/CD Prometheus Grafana Python PostgreSQL Terraform GitLab HPC GPU Telemetry
Remote