Principal ML Engineer, Large Scale Training Performance Optimization

Amd

Hybrid

Quick summary

Work type
Hybrid
Location
San Jose, CABellevue, WA
Posted
96 days ago
Closes
Mar 23, 2027

Market check

Salary context

How this pay compares to similar roles

Similar $219k
$153k most similar roles pay here $273k

This listing doesn't post a salary. Most similar roles pay $188,357–$249,750.

Based on 240 similar postings.

Employer

About Amd

AMD (Advanced Micro Devices) is a semiconductor company that develops high-performance processors, graphics cards, and adaptive computing solutions for gaming, data centers, and embedded markets. Industry: Semiconductors

Amd currently has 56 open roles on FindRole.

Most-posted roles

View all roles at Amd

At a glance

TL;DR · Principal ML Engineer, Large Scale Training Performance Optimization

As a Principal Machine Learning Engineer joining our Models and Applications team, you will lead the development and optimization of distributed training pipelines for large-scale generative AI models on AMD GPUs. Your daily tasks include improving end-to-end training efficiency, optimizing algorithms for scalability, and contributing to open-source projects. Ideal candidates possess expertise in distributed training frameworks like PyTorch, JAX, TensorFlow, Megatron-LM, MaxText, and TorchTitan, along with a strong background in GPU kernel optimization and large model training. You will work closely with various teams to enhance the AMD AI platform's capabilities, ensuring it remains at the forefront of machine learning innovation.

What you'll do

  • Train large models to convergence on AMD GPUs at scale.
  • Improve the end-to-end training pipeline performance continuously.
  • Optimize distributed training algorithms for better scalability.
  • Contribute enhancements and optimizations to open-source projects.
  • Stay informed about advancements in training algorithms and techniques.
  • Guide the development direction of AMD’s AI platform initiatives.

What we're looking for

  • Extensive experience with distributed training pipelines and algorithms.
  • Proficiency in ML/DL frameworks like PyTorch, JAX, or TensorFlow.
  • Expertise in optimizing large model training on GPUs at scale.
  • Strong background in GPU kernel optimization and performance analysis.
  • Master’s degree or PhD in Computer Science or related field required.
  • Excellent Python or C++ programming skills for debugging and profiling.

More like this

Similar roles

Principal Machine Learning Engineer

General Motors (GM)

Remote (Sunnyvale, CA) 101 days ago $296,300$453,200
Python PyTorch Distributed Training AWS GCP Azure GPU Computing C++ Profiling Analysis Debugging Optimization Distributed Systems Cloud Environments
Remote Hybrid

Fellow GPU Performance Optimization Engineer

Amd

San Jose, CA 92 days ago
AMD GPU ROCm Nsight Python PyTorch JAX TensorFlow Megatron-LM Torchtitan MaxText NCCL RCCL C++ CUDA HIP Distributed Training Performance Profiling Bottleneck Analysis Compiler Optimization Graph-Level Optimization
Hybrid

Senior Software Engineer, AI Networking

Nvidia

Santa Clara, CA +1 45 days ago $152,000$241,500
Python PyTorch TensorFlow JAX CUDA NCCL Reinforcement_Learning Bayesian_Optimization GNNs Docker Kubernetes CI/CD Prometheus Grafana Bash C++ RoCE RDMA

Principal Machine Learning Engineer, Content ML, Level 7

Snap Inc.

Bellevue, WA +5 2 days ago $276,000$414,000
Python TensorFlow PyTorch Kubernetes Docker CI/CD PostgreSQL AWS Grafana Prometheus Scalability Availability Multimodal_Modeling Deep_Learning Recommendation_Systems Ranking_Systems Production_Pipelines Clean_Design Machine_Learning_Pods