AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Infrastructure

Apple Inc

Quick summary

Work type: On-site
Location: San Francisco, CA
Salary: $181,100–$318,400 / yr
Posted: 23 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $221k

This role $250k

$162k most similar roles pay here $335k

This role pays more than 80% of similar roles. Most pay $196,375–$246,150 — the shaded band above. At the midpoint, this role pays about $250k versus about $221k for comparable roles.

Based on 239 similar postings.

Employer

About Apple Inc

Apple Inc. is a multinational technology company known for designing and manufacturing consumer electronics, software, and online services, including the iPhone, Mac, iPad, and App Store. Industry: Consumer Electronics & Software

Apple Inc currently has 638 open roles on FindRole.

Listed pay typically runs $171,600–$272,100 across 505 roles with salary data.

Most-posted roles

View all roles at Apple Inc

At a glance

TL;DR · AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Infrastructure

Apply Now Log in to save

Join the ML Compute team as a Staff ML Infrastructure Engineer and drive large-scale pre-training initiatives for cutting-edge foundation models, focusing on resiliency, efficiency, scalability, and resource optimization. You will enhance distributed training techniques, optimize workloads built with JAX, PyTorch, XLA, and CUDA on Kubernetes, and leverage high-performance networking technologies such as NCCL and TPU interconnects. Architect a robust MLOps platform to streamline pretraining operations, operationalize large-scale ML workloads ensuring fault-tolerance, and lead complex technical projects while mentoring engineers in your areas of expertise. Proficiency in Python or Go, distributed systems, Kubernetes, Ray, PySpark, and experience with GPUs, TPUs, and AWS Trainium is essential, along with strong communication skills to collaborate effectively across teams.

Skills

Python Kubernetes Ray PySpark JAX PyTorch TensorFlow CUDA NCCL TPU XLA Docker CI/CD MLOps Prometheus Grafana AWS GPU High-performance networking

What you'll do

Drive large-scale pre-training initiatives for foundation models.
Enhance distributed training techniques to improve system efficiency.
Research and implement new technologies to optimize ML performance.
Optimize execution of workloads on JAX, PyTorch, XLA, CUDA systems.
Architect a robust MLOps platform for streamlined pretraining operations.
Operationalize large-scale ML workloads on Kubernetes for reliability.

What we're looking for

6+ years of experience building scalable backend systems for machine learning models.
Strong expertise in distributed systems, reliability, scalability, containerization, and cloud platforms.
Proficiency in Kubernetes, Ray, PySpark, and other relevant cloud computing tools.
Expertise in programming languages such as Python or Go.
Ability to optimize execution and performance of ML workloads on large distributed systems.
Experience with high-performance networking technologies like NCCL and TPU interconnect.
Knowledge of ML training frameworks including JAX, TensorFlow, PyTorch, and TensorRT.

Similar roles

Sr. / Staff ML Engineer, FM Training Integration - ML Compute

Apple Inc

Santa Clara, CA 23 days ago $181,100–$318,400

Python PyTorch JAX Docker Kubernetes GPU TPU CI/CD NVIDIA Nsight PyTorch Profiler AWS Azure GCP PostgreSQL MongoDB

Save

Software Engineer, ML Infrastructure, Level 4

Snap Inc.

Santa Monica, CA 2 days ago $157,000–$235,000

Python Java Scala C++ Spark Flink Ray TensorFlow PyTorch Distributed Systems Big Data Processing ML Frameworks Scikit-learn

Save

AIML - Sr Machine Learning Engineer - Data and ML Innovation

Apple Inc

Seattle, WA 3 days ago $139,500–$258,100

Python PyTorch TensorFlow JAX AWS Kubernetes Docker CI/CD PostgreSQL MongoDB Git GitHub Slack Zoom Google Cloud Platform Azure Machine Learning Prometheus Grafana

Save

AIML Researcher/Engineer - Foundation Model Post-Training

Apple Inc

New York City, NY 9 days ago

Python PyTorch JAX Reinforcement Learning LLMs Distributed Training Transformers Curriculum Learning Evaluation Methodologies Data Generation Automated Data Filtering

Save

AIML Researcher/Engineer - Foundation Model Post-Training

Apple Inc

Seattle, WA 2 days ago

Python PyTorch JAX Reinforcement_Learning LLMs Distributed_Training Transformers_Architecture Curriculum_Learning Evaluation_Methodologies Deep_Learning CI/CD

Save

Director, ML Engineering

Adobe

San Jose 16 days ago $265,700–$384,675

Python Kubernetes Docker CI/CD Prometheus Grafana AWS GPU CUDA NCCL PostgreSQL MySQL MongoDB Git Jenkins Terraform Ansible Linux Windows Server DevOps Scrum Agile

Save