Fellow GPU Performance Optimization Engineer in San Jose, California | Advanced Micro Devices, Inc

Amd

Hybrid

Quick summary

Work type
Hybrid
Location
San Jose, CA
Salary
$268,000–$268,000 / yr
Posted
78 days ago
Closes
Mar 27, 2027

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $206k
This role $268k
$155k most similar roles pay here $280k

This role pays more than 92% of similar roles. Most pay $177,187–$235,750 — the shaded band above. At the midpoint, this role pays about $268k versus about $206k for comparable roles.

Based on 240 similar postings.

Employer

About Amd

AMD (Advanced Micro Devices) is a semiconductor company that develops high-performance processors, graphics cards, and adaptive computing solutions for gaming, data centers, and embedded markets. Industry: Semiconductors

Amd currently has 65 open roles on FindRole.

Listed pay typically runs $188,000–$188,000 across 65 roles with salary data.

Most-posted roles

View all roles at Amd

At a glance

TL;DR · Fellow GPU Performance Optimization Engineer in San Jose, California | Advanced Micro Devices, Inc

As a Fellow GPU Performance Optimization Engineer at our Models and Applications team, you will lead the optimization of large-scale AI training workloads on AMD GPUs, focusing on single-node and multi-node environments. Your daily tasks include identifying and resolving system bottlenecks across compute, memory, and communication channels to enhance scalability and efficiency through advanced profiling and benchmarking techniques. You will collaborate with hardware, compiler, and framework teams to influence the design of next-generation GPU architecture and software stacks, contributing to open-source projects aimed at improving performance on AMD platforms. Ideal candidates possess deep expertise in GPU architecture, distributed systems, and ML workloads, along with proficiency in Python, C++, CUDA, or HIP, and experience with frameworks like PyTorch and TensorFlow. This role demands a strong understanding of communication libraries such as NCCL/RCCL and the ability to drive impactful optimizations across various layers of the software stack.

What you'll do

  • Lead optimization of large-scale AI training on AMD GPUs for single-node and multi-node environments.
  • Identify and resolve system bottlenecks in compute, memory, and communication across GPU platforms.
  • Optimize distributed training strategies for scalability and efficiency on AMD hardware.
  • Drive cross-stack optimizations from kernels to ML frameworks for performance improvements.
  • Develop advanced profiling methodologies to measure and enhance GPU performance.
  • Influence next-generation GPU architecture and software stack design with hardware teams.

What we're looking for

  • Deep expertise in GPU architecture and performance optimization.
  • Proven experience optimizing large-scale distributed training workloads.
  • Strong understanding of communication libraries and patterns.
  • Expertise in ML frameworks with a focus on performance tuning.
  • Proficiency in Python and systems languages like C++/CUDA/HIP.
  • Experience with compiler stacks and graph-level optimization preferred.
  • Demonstrated technical leadership and ability to influence cross-functional teams.

More like this

Similar roles

Principal Software Quality Engineer – GPU & Machine Learning in San Jose, California | Advanced Micro Devices, Inc

Amd

CA 29 days ago $210,400$210,400
Python C++ GitHub CI/CD ROCm PyTorch JAX TensorFlow vLLM CUDA MPI RCCL NCCL UCX Libfabric Linux GPU Distributed Systems HPC InfiniBand Ethernet BMC/IPMI Thermal Management Firmware GitHub Actions Jenkins LLM-based Coding Agents Trunk-Based Development Shift-Left Testing Observability Feature Flags Release Qualification Programs PR Gating Self-Hosted Runners Required Status Checks Open-Source Contribution \*Repositories Fault Injection RAS Telemetry Long-Haul Stability
Hybrid

GPU Implementation Engineer(Austin & San Diego)

Qualcomm

Austin, TX +1 66 days ago $161,800$242,600
Design Compiler Fusion Compiler Genus Innovus Conformal LEC Formality PrimeTime Tcl Python UPF GPU microarchitecture EDA tools Power vector generation Power analysis Synthesis and place-and-route tools Advanced process nodes