Principal ML Engineer - Large Scale Training Performance Optimization in San Jose, California | Advanced Micro Devices, Inc

Amd

Hybrid

Quick summary

Work type
Hybrid
Location
San Jose, CABellevue, WA
Salary
$240,000–$240,000 / yr
Posted
80 days ago
Closes
Mar 23, 2027

Market check

Salary context

Competitive pay

How this pay compares to similar roles

Similar $225k
This role $240k
$169k most similar roles pay here $274k

This role pays more than 63% of similar roles. Most pay $199,850–$249,750 — the shaded band above. At the midpoint, this role pays about $240k versus about $225k for comparable roles.

Based on 240 similar postings.

Employer

About Amd

AMD (Advanced Micro Devices) is a semiconductor company that develops high-performance processors, graphics cards, and adaptive computing solutions for gaming, data centers, and embedded markets. Industry: Semiconductors

Amd currently has 71 open roles on FindRole.

Listed pay typically runs $178,400–$178,400 across 71 roles with salary data.

Most-posted roles

View all roles at Amd

At a glance

TL;DR · Principal ML Engineer - Large Scale Training Performance Optimization in San Jose, California | Advanced Micro Devices, Inc

As a Principal Machine Learning Engineer joining our Models and Applications team, you will lead the development and optimization of distributed training pipelines for large-scale generative AI models on AMD GPUs. Your daily tasks include improving end-to-end training efficiency, optimizing algorithms for scalability, and contributing to open-source projects. Ideal candidates possess expertise in distributed training frameworks like PyTorch, JAX, TensorFlow, Megatron-LM, MaxText, and TorchTitan, along with a strong background in GPU kernel optimization and large model training. You will work closely with various teams to enhance the AMD AI platform's capabilities, ensuring it remains at the forefront of machine learning innovation.

What you'll do

  • Train large models to convergence on AMD GPUs at scale.
  • Improve the end-to-end training pipeline performance continuously.
  • Optimize distributed training algorithms for better scalability.
  • Contribute enhancements and optimizations to open-source projects.
  • Stay informed about advancements in training algorithms and techniques.
  • Guide the development direction of AMD’s AI platform initiatives.

What we're looking for

  • Extensive experience with distributed training pipelines and algorithms.
  • Proficiency in ML/DL frameworks like PyTorch, JAX, or TensorFlow.
  • Expertise in optimizing large model training on GPUs at scale.
  • Strong background in GPU kernel optimization and performance analysis.
  • Master’s degree or PhD in Computer Science or related field required.
  • Excellent Python or C++ programming skills for debugging and profiling.

More like this

Similar roles

Sr. ML Engineer – ML & Applied AI

Gap Inc

Remote (San Francisco, CA) 39 days ago
Python scikit-learn XGBoost PyTorch TensorFlow FastAPI Kubernetes Docker AWS CI/CD Git SQL Spark Prometheus Grafana MLOps LLMs Vector databases RAG Agentic workflows
Remote