Principal ML Engineer - Large Scale Training Performance Optimization in San Jose, California | Advanced Micro Devices, Inc

Amd

Hybrid

Quick summary

Work type: Hybrid
Location: San Jose, CABellevue, WA
Salary: $240,000–$240,000 / yr
Posted: 80 days ago
Closes: Mar 23, 2027
Nearby: 99+ roles within 25 mi

Market check

Salary context

Competitive pay

How this pay compares to similar roles

Similar $225k

This role $240k

$169k most similar roles pay here $274k

This role pays more than 63% of similar roles. Most pay $199,850–$249,750 — the shaded band above. At the midpoint, this role pays about $240k versus about $225k for comparable roles.

Based on 240 similar postings.

Employer

About Amd

AMD (Advanced Micro Devices) is a semiconductor company that develops high-performance processors, graphics cards, and adaptive computing solutions for gaming, data centers, and embedded markets. Industry: Semiconductors

Amd currently has 71 open roles on FindRole.

Listed pay typically runs $178,400–$178,400 across 71 roles with salary data.

Most-posted roles

View all roles at Amd

At a glance

TL;DR · Principal ML Engineer - Large Scale Training Performance Optimization in San Jose, California | Advanced Micro Devices, Inc

Apply Now Log in to save

As a Principal Machine Learning Engineer joining our Models and Applications team, you will lead the development and optimization of distributed training pipelines for large-scale generative AI models on AMD GPUs. Your daily tasks include improving end-to-end training efficiency, optimizing algorithms for scalability, and contributing to open-source projects. Ideal candidates possess expertise in distributed training frameworks like PyTorch, JAX, TensorFlow, Megatron-LM, MaxText, and TorchTitan, along with a strong background in GPU kernel optimization and large model training. You will work closely with various teams to enhance the AMD AI platform's capabilities, ensuring it remains at the forefront of machine learning innovation.

Skills

PyTorch TensorFlow JAX Megatron-LM MaxText TorchTitan Python C++ Distributed Training GPU Kernel Optimization CI/CD Prometheus Grafana

What you'll do

Train large models to convergence on AMD GPUs at scale.
Improve the end-to-end training pipeline performance continuously.
Optimize distributed training algorithms for better scalability.
Contribute enhancements and optimizations to open-source projects.
Stay informed about advancements in training algorithms and techniques.
Guide the development direction of AMD’s AI platform initiatives.

What we're looking for

Extensive experience with distributed training pipelines and algorithms.
Proficiency in ML/DL frameworks like PyTorch, JAX, or TensorFlow.
Expertise in optimizing large model training on GPUs at scale.
Strong background in GPU kernel optimization and performance analysis.
Master’s degree or PhD in Computer Science or related field required.
Excellent Python or C++ programming skills for debugging and profiling.

Similar roles

Lead Gen AI / ML Engineer in Austin, Texas | Advanced Micro Devices, Inc

Amd

Austin, TX 64 days ago $168,000–$168,000

Python Scikit-Learn PyTorch TensorFlow SQL MLOps ETL AWS Azure GCP CI/CD Langgraph Prometheus Kubernetes

Hybrid

Save

Principal AI Performance Engineer in San Jose, California | Advanced Micro Devices, Inc

Amd

San Jose, CA 92 days ago $240,000–$240,000

Python C++ vLLM SGLang TensorRT-LLM HIP CUDA Triton CK Linux GPU AI agents CI/CD PyTorch Kubernetes

Hybrid

Save

Sr. Fellow, ML Workload Performance in San Jose, California | Advanced Micro Devices, Inc

Amd

San Jose, CA +1 150 days ago $292,000–$292,000

Python C++ CUDA TensorFlow PyTorch AMD GPUs MLOps Distributed Systems CI/CD Performance Modeling Benchmarking LLMs Diffusion Models Multimodal Systems RecSys Generative AI Kernel Optimization Hardware-Software Co-design

Save

Principal ML Engineer, Machine Learning Platform and Systems Architecture

Autodesk

Remote (Canada) 36 days ago $152,000–$272,250

Python Kubernetes Ray Airflow Spark CI/CD Terraform Docker Prometheus Grafana PostgreSQL AWS Azure Google Cloud Platform Git Jenkins Ansible Chef JSON YAML REST APIs Swagger GraphQL

Remote

Save

Principal Software Development Eng. - AI Performance in San Jose, California | Advanced Micro Devices, Inc

Amd

San Jose, CA 105 days ago $240,000–$240,000

CUDA HIP Python C++ LLVM MLIR Triton Gluon PyTorch vLLM SGLang xDiT Megatron LM Linux GPU HPC AI systems roofline analysis performance engineering multi-GPU communication

Hybrid

Save

Sr. ML Engineer – ML & Applied AI

Gap Inc

Remote (San Francisco, CA) 39 days ago

Python scikit-learn XGBoost PyTorch TensorFlow FastAPI Kubernetes Docker AWS CI/CD Git SQL Spark Prometheus Grafana MLOps LLMs Vector databases RAG Agentic workflows

Remote

Save