Staff ML Platform Engineer - Recommendations

Shopify

Quick summary

Work type
On-site
Location
Posted
45 days ago

Market check

Salary context

How this pay compares to similar roles

Similar $220k
$156k most similar roles pay here $276k

This listing doesn't post a salary. Most similar roles pay $190,225–$249,750.

Based on 240 similar postings.

Employer

About Shopify

Shopify is a leading global commerce platform that enables businesses of all sizes to start, grow, and manage their retail operations online and in-person. It provides tools for storefronts, payments, shipping, and marketing to millions of merchants worldwide.

Shopify currently has 28 open roles on FindRole.

Most-posted roles

View all roles at Shopify

At a glance

TL;DR · Staff ML Platform Engineer - Recommendations

As a Senior Infrastructure Engineer at Shopify, you will lead the development and maintenance of core ML infrastructure, including GPU training clusters and real-time serving systems, ensuring they meet strict latency requirements during high-traffic events like Black Friday. Your daily tasks involve designing Kubernetes-based training pipelines, optimizing performance with techniques such as mixed precision and kernel tuning, and building abstractions for seamless model iteration. You will also mentor engineers, drive technical strategy, and contribute to hiring efforts, requiring deep expertise in distributed systems, GPU training, and Kubernetes, alongside proficiency in Python and PyTorch. This role demands hands-on experience with cloud-native ML orchestration tools and a track record of technical leadership and mentoring.

What you'll do

  • Design and operate GPU training pipelines on Kubernetes, including multi-node distributed training.
  • Own training reliability: checkpointing, fault tolerance, preemption recovery, and resource scheduling.
  • Build model serving infrastructure for real-time recommendation with strict latency requirements.
  • Optimize serving cost and performance through batching strategies, GPU right-sizing, and autoscaling.
  • Define infrastructure patterns and best practices adopted across the team to improve developer experience.

What we're looking for

  • 7+ years of software engineering experience, with a focus on ML infrastructure or distributed systems
  • Deep hands-on experience with GPU training at scale, including distributed training and performance tuning
  • Strong Kubernetes skills for managing stateful GPU workloads and optimizing resource scheduling
  • Production model serving experience with real-time traffic and strict latency requirements
  • Demonstrated technical leadership in driving architecture decisions and influencing engineering direction
  • Experience designing infrastructure abstractions used by other engineers to improve developer efficiency

More like this

Similar roles

Senior ML Platform Engineer

Nvidia

Remote (Santa Clara, CA) +3 10 days ago $152,000$241,500
Terraform Ansible Python Go Kubernetes Docker CI/CD Prometheus Grafana PyTorch TensorFlow Horovod NCCL GitOps Linux Networking Performance_Tuning SRE ML_Workflows GPU_Technologies
Remote

ML Platform Engineer

Apple Inc

Sunnyvale, CA 59 days ago $147,400$272,100
Python PyTorch TensorFlow JAX Docker Kubernetes CI/CD AWS GCP Azure Spark CoreML Metal CUDA OpenCL Swift C++ Terraform Prometheus