Machine Learning Infrastructure Engineers

Shopify

Quick summary

Work type
On-site
Location
Posted
45 days ago

Market check

Salary context

How this pay compares to similar roles

Similar $210k
$154k most similar roles pay here $269k

This listing doesn't post a salary. Most similar roles pay $173,150–$246,150.

Based on 240 similar postings.

Employer

About Shopify

Shopify is a leading global commerce platform that enables businesses of all sizes to start, grow, and manage their retail operations online and in-person. It provides tools for storefronts, payments, shipping, and marketing to millions of merchants worldwide.

Shopify currently has 28 open roles on FindRole.

Most-posted roles

View all roles at Shopify

At a glance

TL;DR · Machine Learning Infrastructure Engineers

As a Machine Learning Infrastructure Engineer at Shopify, you will join an agile team responsible for building and operating the end-to-end platform that powers AI applications. Your day-to-day responsibilities include designing high-performance GPU-accelerated systems on Kubernetes, crafting self-serve developer experiences, and optimizing multi-tenant clusters for autoscaling and scheduling. You will also own the model lifecycle from training to inference, build real-time serving stacks, and ensure observability across pipelines with SLOs. The role requires deep expertise in Kubernetes, GPU infrastructure, distributed systems, and automation tools like Terraform and Helm. Proficiency in Python, Go, or Java is essential, along with experience in building developer tooling and self-service platforms. This position involves working closely with ML, data, and product teams to accelerate idea-to-impact delivery at global scale.

What you'll do

  • Build and operate ML control planes, APIs, CLIs, SDKs, and self-serve developer paths.
  • Design and optimize multi-tenant GPU Kubernetes clusters for autoscaling and scheduling.
  • Manage model lifecycle including training orchestration, registries, CI/CD, and safe rollbacks.
  • Construct real-time serving stacks and end-to-end pipelines for batch and streaming data.
  • Engineer feature platforms and storage solutions optimized for cost and performance.
  • Implement observability and SLOs across ML systems to automate remediation and planning.

What we're looking for

  • Proven experience in platform/infrastructure engineering with a track record of shipping production systems.
  • Deep expertise in Kubernetes for ML workloads including operators, Helm, and service mesh/gRPC.
  • Hands-on experience managing GPU infrastructure at scale within the NVIDIA ecosystem.
  • Strong background in distributed systems and API/service design for high-scale inference.
  • Proficiency in Terraform, Helm, GitOps, and major cloud platforms (GCP/AWS/Azure).
  • Expertise in observability tools like Prometheus/Grafana and SLO-driven operations for ML systems.
  • Proficient in Python/Go/Java with experience building developer tooling and self-service platforms.

More like this

Similar roles

AI Infrastructure Engineer

Electronic Arts

Vancouver, British Columbia, Canada 11 days ago $122,300$170,700
AWS AWS CDK Python CI/CD DevSecOps Datadog Prometheus Grafana OpenTelemetry Kafka SNS/SQS Kubernetes Databricks
Hybrid

Applied Machine Learning Engineers

Shopify

US 45 days ago
Python TensorFlow PyTorch LLMs Reinforcement Learning Model Quantization Multimodal Models Docker Kubernetes CI/CD PostgreSQL AWS Petabyte-Scale Data Embeddings

Machine Learning Engineer

Adobe

San Jose 81 days ago $183,300$265,350
Python PyTorch LangChain LangGraph MCP ADK LLMs VLLMs CI/CD Docker AWS PostgreSQL Kubernetes

Machine Learning Engineer

Adobe

San Jose 91 days ago $161,700$234,150
Python TensorFlow PyTorch scikit-learn SparkML Kubernetes AWS CI/CD SQL Docker PostgreSQL MLOps

Machine Learning Engineer

Motorola Solutions

Los Angeles, CA 63 days ago $120,000$160,000
Python TensorFlow PyTorch scikit-learn MATLAB C++ signal processing wireless communication MIMO OFDM SDRs GPU acceleration embedded machine learning real-time systems adaptive modulation beamforming cognitive radio techniques 3GPP IEEE 802.11/15 military waveforms
Hybrid

Machine Learning Engineer

Q2

Austin, TX 53 days ago
Python TensorFlow PyTorch scikit-learn R Java cloud platforms scalable computing resources machine learning pipelines data analysis statistics optimization probability theory experimental methodologies CI/CD
Hybrid