Principal ML Engineer, Large Scale Training Performance Optimization

Amd

Hybrid

Quick summary

Work type: Hybrid
Location: San Jose, CABellevue, WA
Posted: 96 days ago
Closes: Mar 23, 2027
Nearby: 99+ roles within 25 mi

Market check

Salary context

How this pay compares to similar roles

Similar $219k

$153k most similar roles pay here $273k

This listing doesn't post a salary. Most similar roles pay $188,357–$249,750.

Based on 240 similar postings.

Employer

About Amd

AMD (Advanced Micro Devices) is a semiconductor company that develops high-performance processors, graphics cards, and adaptive computing solutions for gaming, data centers, and embedded markets. Industry: Semiconductors

Amd currently has 56 open roles on FindRole.

Most-posted roles

View all roles at Amd

At a glance

TL;DR · Principal ML Engineer, Large Scale Training Performance Optimization

Role Posting Log in to save

As a Principal Machine Learning Engineer joining our Models and Applications team, you will lead the development and optimization of distributed training pipelines for large-scale generative AI models on AMD GPUs. Your daily tasks include improving end-to-end training efficiency, optimizing algorithms for scalability, and contributing to open-source projects. Ideal candidates possess expertise in distributed training frameworks like PyTorch, JAX, TensorFlow, Megatron-LM, MaxText, and TorchTitan, along with a strong background in GPU kernel optimization and large model training. You will work closely with various teams to enhance the AMD AI platform's capabilities, ensuring it remains at the forefront of machine learning innovation.

Skills

PyTorch TensorFlow JAX Megatron-LM MaxText TorchTitan Python C++ Distributed Training GPU Kernel Optimization CI/CD Prometheus Grafana

What you'll do

Train large models to convergence on AMD GPUs at scale.
Improve the end-to-end training pipeline performance continuously.
Optimize distributed training algorithms for better scalability.
Contribute enhancements and optimizations to open-source projects.
Stay informed about advancements in training algorithms and techniques.
Guide the development direction of AMD’s AI platform initiatives.

What we're looking for

Extensive experience with distributed training pipelines and algorithms.
Proficiency in ML/DL frameworks like PyTorch, JAX, or TensorFlow.
Expertise in optimizing large model training on GPUs at scale.
Strong background in GPU kernel optimization and performance analysis.
Master’s degree or PhD in Computer Science or related field required.
Excellent Python or C++ programming skills for debugging and profiling.

Similar roles

Principal Machine Learning Engineer

General Motors (GM)

Remote (Sunnyvale, CA) 101 days ago $296,300–$453,200

Python PyTorch Distributed Training AWS GCP Azure GPU Computing C++ Profiling Analysis Debugging Optimization Distributed Systems Cloud Environments

Remote Hybrid

Save

Fellow GPU Performance Optimization Engineer

Amd

San Jose, CA 92 days ago

AMD GPU ROCm Nsight Python PyTorch JAX TensorFlow Megatron-LM Torchtitan MaxText NCCL RCCL C++ CUDA HIP Distributed Training Performance Profiling Bottleneck Analysis Compiler Optimization Graph-Level Optimization

Hybrid

Save

Senior Software Engineer, AI Networking

Nvidia

Santa Clara, CA +1 45 days ago $152,000–$241,500

Python PyTorch TensorFlow JAX CUDA NCCL Reinforcement_Learning Bayesian_Optimization GNNs Docker Kubernetes CI/CD Prometheus Grafana Bash C++ RoCE RDMA

Save

Senior System Software Engineer, Dynamo-Triton Inference Server

Nvidia

Remote (Santa Clara, CA) +1 62 days ago $152,000–$241,500

Rust C++ Python TensorRT PyTorch ONNX OpenVINO vLLM TRT-LLM GPU Distributed Systems GitHub CI/CD Kubernetes Prometheus Grafana AWS Azure Google Cloud

Remote

Save

Principal Machine Learning Engineer, Content ML, Level 7

Snap Inc.

Bellevue, WA +5 2 days ago $276,000–$414,000

Python TensorFlow PyTorch Kubernetes Docker CI/CD PostgreSQL AWS Grafana Prometheus Scalability Availability Multimodal_Modeling Deep_Learning Recommendation_Systems Ranking_Systems Production_Pipelines Clean_Design Machine_Learning_Pods

Save

ML Engineer, Evaluation Analysis, Metric and Data Strategy

Apple Inc

San Diego, CA 66 days ago $139,500–$258,100

Python pandas scipy scikit-learn R statistical_analysis experimental_design data_collection evaluation_metrics AI_evaluation agentic_experiences LangChain LangGraph CrewAI A2A MCP

Save