Staff ML Platform Engineer - Recommendations

Shopify

Quick summary

Work type: On-site
Location: —
Posted: 45 days ago

Market check

Salary context

How this pay compares to similar roles

Similar $220k

$156k most similar roles pay here $276k

This listing doesn't post a salary. Most similar roles pay $190,225–$249,750.

Based on 240 similar postings.

Employer

About Shopify

Shopify is a leading global commerce platform that enables businesses of all sizes to start, grow, and manage their retail operations online and in-person. It provides tools for storefronts, payments, shipping, and marketing to millions of merchants worldwide.

Shopify currently has 28 open roles on FindRole.

Most-posted roles

View all roles at Shopify

At a glance

TL;DR · Staff ML Platform Engineer - Recommendations

Apply Now Log in to save

As a Senior Infrastructure Engineer at Shopify, you will lead the development and maintenance of core ML infrastructure, including GPU training clusters and real-time serving systems, ensuring they meet strict latency requirements during high-traffic events like Black Friday. Your daily tasks involve designing Kubernetes-based training pipelines, optimizing performance with techniques such as mixed precision and kernel tuning, and building abstractions for seamless model iteration. You will also mentor engineers, drive technical strategy, and contribute to hiring efforts, requiring deep expertise in distributed systems, GPU training, and Kubernetes, alongside proficiency in Python and PyTorch. This role demands hands-on experience with cloud-native ML orchestration tools and a track record of technical leadership and mentoring.

Skills

Kubernetes Python GPU Docker CI/CD Prometheus Grafana Terraform AWS LLM PyTorch SkyPilot Ray vLLM TensorRT-LLM Triton PostgreSQL GitOps MLOps

What you'll do

Design and operate GPU training pipelines on Kubernetes, including multi-node distributed training.
Own training reliability: checkpointing, fault tolerance, preemption recovery, and resource scheduling.
Build model serving infrastructure for real-time recommendation with strict latency requirements.
Optimize serving cost and performance through batching strategies, GPU right-sizing, and autoscaling.
Define infrastructure patterns and best practices adopted across the team to improve developer experience.

What we're looking for

7+ years of software engineering experience, with a focus on ML infrastructure or distributed systems
Deep hands-on experience with GPU training at scale, including distributed training and performance tuning
Strong Kubernetes skills for managing stateful GPU workloads and optimizing resource scheduling
Production model serving experience with real-time traffic and strict latency requirements
Demonstrated technical leadership in driving architecture decisions and influencing engineering direction
Experience designing infrastructure abstractions used by other engineers to improve developer efficiency

Similar roles

Staff ML Ops Engineer - Recommendations

Shopify

US 45 days ago

Python Kubernetes Airflow CI/CD PyTorch SLOs BigQuery Dagster Prefect SkyPilot Ray Prometheus Grafana MLOps

Save

Senior ML Platform Engineer

Nvidia

Remote (Santa Clara, CA) +3 10 days ago $152,000–$241,500

Terraform Ansible Python Go Kubernetes Docker CI/CD Prometheus Grafana PyTorch TensorFlow Horovod NCCL GitOps Linux Networking Performance_Tuning SRE ML_Workflows GPU_Technologies

Remote

Save

ML Platform Engineer

Apple Inc

Sunnyvale, CA 59 days ago $147,400–$272,100

Python PyTorch TensorFlow JAX Docker Kubernetes CI/CD AWS GCP Azure Spark CoreML Metal CUDA OpenCL Swift C++ Terraform Prometheus

Save

Software Engineer, ML platform and Infrastructure

Apple Inc

San Francisco, CA 66 days ago $212,000–$318,400

Python Java Go Kubernetes AWS GCP LangGraph LangChain DevOps Docker CI/CD Prometheus Grafana Spark Flink Iceberg Snowflake

Save

Software Engineer, ML platform and Infrastructure

Apple Inc

Austin, TX 66 days ago

Python Java Go Kubernetes AWS GCP LangGraph LangChain DevOps CI/CD Docker Prometheus Grafana Spark Flink Iceberg Snowflake

Save

Senior Staff ML Engineer, Search & Recommendation

Remote (US) 11 days ago $266,000–$372,400

Python PyTorch TensorFlow LLM Search Engine Recommendation Systems ML Models CI/CD Git Kubernetes Docker PostgreSQL Redis Elasticsearch GraphQL

Remote

Save