Job Description
Shopify is the commerce platform that powers millions of merchants worldwide. Behind the product experience are ML systems that drive recommendations, search, and personalization at massive scale.
We build and maintain the operational backbone behind these systems: deployment pipelines, evaluation frameworks, data preprocessing, and the monitoring that keeps models fresh and reliable in production. Our models serve hundreds of millions of buyers, and the pipelines we build directly impact how quickly and safely we can improve merchant outcomes.
The Role
You will own the operational lifecycle of our ML systems: deployment pipelines, evaluation frameworks, data pipelines, and the monitoring and reliability layer that keeps everything running in production. You'll ensure models go from training to production safely, that we can evaluate changes rigorously, and that the data feeding our models is fresh and correct.
This role is the connective tissue between research and production. You'll build the systems that let engineers ship model improvements with confidence and speed, while maintaining the reliability standards required to serve hundreds of millions of buyers - including during peak events like Black Friday/Cyber Monday.
This role carries real technical authority. You'll set the standards for how models get deployed and evaluated, mentor engineers on operational best practices, and drive alignment on reliability and pipeline strategy across the team. You'll influence technical direction beyond your immediate team and raise the engineering bar through hiring and technical reviews.
What You'll Do
Deployment & Rollout
- Own the model deployment pipeline end to end: export, validation, canary rollout, rollback, and A/B integration
- Build and maintain CI/CD for ML: automated testing, model validation gates, and progressive delivery
- Ensure safe, repeatable deployments with clear rollback paths and minimal manual intervention
Evaluation & Experimentation
- Build automated offline evaluation pipelines against production baselines
- Extend our experimentation framework so ML Engineers can launch and evaluate model changes with minimal friction
- Maintain evaluation datasets and ensure data freshness and correctness
- Integrate offline metrics with online A/B testing to close the feedback loop
Data Pipelines
- Own data preprocessing for training: interaction histories, feature stores, and embedding tables
- Manage workflow orchestration (Airflow or equivalent) for scheduled retraining and data refresh. You trigger and monitor training runs; the underlying GPU compute layer is owned by the infrastructure side of the team.
- Ensure data quality, lineage tracking, and pipeline idempotency
- Own data correctness and freshness; partner with infrastructure engineers on data loading throughput and efficiency
Monitoring & Reliability
- Build monitoring and alerting across training jobs, serving endpoints, and data pipelines
- Define and maintain SLOs for model freshness, serving latency, and training throughput
- Participate in incident response and drive post-mortems for ML system failures
- Identify and eliminate toil through automation
Technical Leadership
- Drive cross-team technical strategy for ML operations - identify systemic reliability risks and pipeline bottlenecks before they become incidents
- Mentor and up-level engineers on the team through pairing, design reviews, and setting operational standards
- Contribute to hiring: screen candidates, conduct technical interviews, and calibrate the engineering bar
- Write technical proposals and RFCs that shape operational direction across the organization
What We're Looking For
Required
- 7+ years in software engineering, with 5+ years focused on MLOps, data engineering, or production ML systems
- Strong experience with ML deployment pipelines: model export, validation, canary releases, and rollback strategies
- Experience with workflow orchestration for ML (Airflow, Dagster, Prefect, or similar)
- Solid Python fundamentals; comfortable working with PyTorch model artifacts and training configurations
- Production monitoring experience: you've built or operated alerting, dashboards, and SLO frameworks for ML systems
- Experience with data pipelines at scale: batch processing, feature engineering, and data quality validation
- Working proficiency with Kubernetes: able to debug pod failures, understand resource scheduling, and navigate GPU workloads
- Demonstrated technical leadership: you've driven operational strategy, written technical proposals, and influenced engineering direction beyond your immediate team
- Track record of mentoring engineers and raising the reliability bar on a team
Preferred
- Experience with large-scale data warehouses (BigQuery or equivalent) for offline evaluation and metrics
- Hands-on with experiment tracking and A/B testing frameworks
- Experience operating recommendation or retrieval systems at scale
- Familiarity with model compression workflows in production (quantization, pruning, distillation)
- Experience with cloud-native ML orchestration (SkyPilot, Ray, or similar)
How We Work
- You'll pair directly with ML Engineers. Understanding their models well enough to build the right operational workflows is part of the job.
- We prefer automation over runbooks. If a process can be scripted, it should be.
- On-call is shared. When you're on rotation, your scope is pipeline failures, data freshness alerts, deployment rollbacks, and evaluation integrity - you own it end to end.
- You'll dig into Airflow DAG failures, data drift alerts, and deployment validation issues. This is a deeply operational role with high production stakes.
- Research and production are the same codebase. You'll see your operational decisions reflected in real model quality and real merchant outcomes.
- Shopify operates on high trust and low process. You'll have real ownership and the autonomy to make decisions, not just execute tickets.
What Success Looks Like
- In 3 months: You've onboarded to deployment and evaluation pipelines, shipped at least one meaningful improvement to deployment safety or developer experience, and can independently debug issues across the operational stack.
- In 6 months: You own a major subsystem (deployment pipeline, evaluation framework, or data pipelines). Researchers are shipping model changes faster or more safely because of improvements you've made.
- In 12 months: You've shaped the operational roadmap for ML systems and influenced engineering direction beyond the team. Deployments are faster and safer, evaluation is more rigorous, and the team trusts the pipelines you've built. Other engineers across the organization come to you for guidance on ML operational best practices. You've made the team stronger through hiring and mentorship.
For more details click Job Post.
About Shopify
Shopify is a global commerce company providing a leading e-commerce platform and ecosystem of tools that allows businesses of all sizes to build, manage, and grow their online and physical retail operations. Industry: E-Commerce Technology & Payments