AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Infrastructure

Apple Inc

Quick summary

Work type
On-site
Location
San Francisco, CA
Salary
$181,100–$318,400 / yr
Posted
23 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $221k
This role $250k
$162k most similar roles pay here $335k

This role pays more than 80% of similar roles. Most pay $196,375–$246,150 — the shaded band above. At the midpoint, this role pays about $250k versus about $221k for comparable roles.

Based on 239 similar postings.

Employer

About Apple Inc

Apple Inc. is a multinational technology company known for designing and manufacturing consumer electronics, software, and online services, including the iPhone, Mac, iPad, and App Store. Industry: Consumer Electronics & Software

Apple Inc currently has 638 open roles on FindRole.

Listed pay typically runs $171,600–$272,100 across 505 roles with salary data.

Most-posted roles

View all roles at Apple Inc

At a glance

TL;DR · AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Infrastructure

Join the ML Compute team as a Staff ML Infrastructure Engineer and drive large-scale pre-training initiatives for cutting-edge foundation models, focusing on resiliency, efficiency, scalability, and resource optimization. You will enhance distributed training techniques, optimize workloads built with JAX, PyTorch, XLA, and CUDA on Kubernetes, and leverage high-performance networking technologies such as NCCL and TPU interconnects. Architect a robust MLOps platform to streamline pretraining operations, operationalize large-scale ML workloads ensuring fault-tolerance, and lead complex technical projects while mentoring engineers in your areas of expertise. Proficiency in Python or Go, distributed systems, Kubernetes, Ray, PySpark, and experience with GPUs, TPUs, and AWS Trainium is essential, along with strong communication skills to collaborate effectively across teams.

What you'll do

  • Drive large-scale pre-training initiatives for foundation models.
  • Enhance distributed training techniques to improve system efficiency.
  • Research and implement new technologies to optimize ML performance.
  • Optimize execution of workloads on JAX, PyTorch, XLA, CUDA systems.
  • Architect a robust MLOps platform for streamlined pretraining operations.
  • Operationalize large-scale ML workloads on Kubernetes for reliability.

What we're looking for

  • 6+ years of experience building scalable backend systems for machine learning models.
  • Strong expertise in distributed systems, reliability, scalability, containerization, and cloud platforms.
  • Proficiency in Kubernetes, Ray, PySpark, and other relevant cloud computing tools.
  • Expertise in programming languages such as Python or Go.
  • Ability to optimize execution and performance of ML workloads on large distributed systems.
  • Experience with high-performance networking technologies like NCCL and TPU interconnect.
  • Knowledge of ML training frameworks including JAX, TensorFlow, PyTorch, and TensorRT.

More like this

Similar roles

Director, ML Engineering

Adobe

San Jose 16 days ago $265,700$384,675
Python Kubernetes Docker CI/CD Prometheus Grafana AWS GPU CUDA NCCL PostgreSQL MySQL MongoDB Git Jenkins Terraform Ansible Linux Windows Server DevOps Scrum Agile