Fellow GPU Performance Optimization Engineer in San Jose, California | Advanced Micro Devices, Inc

Amd

Hybrid

Quick summary

Work type: Hybrid
Location: San Jose, CA
Salary: $268,000–$268,000 / yr
Posted: 78 days ago
Closes: Mar 27, 2027
Nearby: 99+ roles within 25 mi

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $206k

This role $268k

$155k most similar roles pay here $280k

This role pays more than 92% of similar roles. Most pay $177,187–$235,750 — the shaded band above. At the midpoint, this role pays about $268k versus about $206k for comparable roles.

Based on 240 similar postings.

Employer

About Amd

AMD (Advanced Micro Devices) is a semiconductor company that develops high-performance processors, graphics cards, and adaptive computing solutions for gaming, data centers, and embedded markets. Industry: Semiconductors

Amd currently has 65 open roles on FindRole.

Listed pay typically runs $188,000–$188,000 across 65 roles with salary data.

Most-posted roles

View all roles at Amd

At a glance

TL;DR · Fellow GPU Performance Optimization Engineer in San Jose, California | Advanced Micro Devices, Inc

Apply Now Log in to save

As a Fellow GPU Performance Optimization Engineer at our Models and Applications team, you will lead the optimization of large-scale AI training workloads on AMD GPUs, focusing on single-node and multi-node environments. Your daily tasks include identifying and resolving system bottlenecks across compute, memory, and communication channels to enhance scalability and efficiency through advanced profiling and benchmarking techniques. You will collaborate with hardware, compiler, and framework teams to influence the design of next-generation GPU architecture and software stacks, contributing to open-source projects aimed at improving performance on AMD platforms. Ideal candidates possess deep expertise in GPU architecture, distributed systems, and ML workloads, along with proficiency in Python, C++, CUDA, or HIP, and experience with frameworks like PyTorch and TensorFlow. This role demands a strong understanding of communication libraries such as NCCL/RCCL and the ability to drive impactful optimizations across various layers of the software stack.

Skills

AMD GPU ROCm Nsight Python PyTorch JAX TensorFlow Megatron-LM Torchtitan MaxText NCCL RCCL C++ CUDA HIP Distributed Training Performance Profiling Bottleneck Analysis Compiler Optimization Graph-Level Optimization

What you'll do

Lead optimization of large-scale AI training on AMD GPUs for single-node and multi-node environments.
Identify and resolve system bottlenecks in compute, memory, and communication across GPU platforms.
Optimize distributed training strategies for scalability and efficiency on AMD hardware.
Drive cross-stack optimizations from kernels to ML frameworks for performance improvements.
Develop advanced profiling methodologies to measure and enhance GPU performance.
Influence next-generation GPU architecture and software stack design with hardware teams.

What we're looking for

Deep expertise in GPU architecture and performance optimization.
Proven experience optimizing large-scale distributed training workloads.
Strong understanding of communication libraries and patterns.
Expertise in ML frameworks with a focus on performance tuning.
Proficiency in Python and systems languages like C++/CUDA/HIP.
Experience with compiler stacks and graph-level optimization preferred.
Demonstrated technical leadership and ability to influence cross-functional teams.

Similar roles

Principal / Senior GPU SW Performance Engineer — Post‑Training in San Jose, California | Advanced Micro Devices, Inc

Amd

CA 43 days ago $204,000–$204,000

Python C++ PyTorch ROCm HIP Triton torch.distributed FSDP ZeRO CUDA Docker Kubernetes Git GitHub Jenkins Slack Zoom Markdown Confluence Bash SQL PostgreSQL Prometheus Grafana CI/CD

Hybrid

Save

Principal / Senior GPU SW Performance Engineer — Post‑Training in San Jose, California | Advanced Micro Devices, Inc

Amd

CA 80 days ago $240,000–$240,000

Python PyTorch C++ ROCm HIP AMD Instinct GPUs Distributed training Multi-GPU/multi-node SFT LoRA RL-based training torch.distributed FSDP ZeRO Distributed systems Collective communication libraries CI/CD

Hybrid

Save

Senior GPU Product Application Engineer in Austin, Texas | Advanced Micro Devices, Inc

Amd

Santa Clara, CA +1 93 days ago $161,200–$161,200

PCIe GPU HPC OEM AMD Instinct™ Accelerators Linux Ubuntu CentOS RHEL SLES Shell BASH C C++ Python

Save

Principal Engineer - GPU Software Architect in Santa Clara, California | Advanced Micro Devices, Inc

Amd

Remote (US) 106 days ago $240,000–$240,000

C C++ Python GPU AI-assisted software development tools ASIC Firmware Drivers Performance modeling Simulators Low-level debugging tools

Remote

Save

Principal Software Quality Engineer – GPU & Machine Learning in San Jose, California | Advanced Micro Devices, Inc

Amd

CA 29 days ago $210,400–$210,400

Python C++ GitHub CI/CD ROCm PyTorch JAX TensorFlow vLLM CUDA MPI RCCL NCCL UCX Libfabric Linux GPU Distributed Systems HPC InfiniBand Ethernet BMC/IPMI Thermal Management Firmware GitHub Actions Jenkins LLM-based Coding Agents Trunk-Based Development Shift-Left Testing Observability Feature Flags Release Qualification Programs PR Gating Self-Hosted Runners Required Status Checks Open-Source Contribution \*Repositories Fault Injection RAS Telemetry Long-Haul Stability

Hybrid

Save

GPU Implementation Engineer(Austin & San Diego)

Qualcomm

Austin, TX +1 66 days ago $161,800–$242,600

Design Compiler Fusion Compiler Genus Innovus Conformal LEC Formality PrimeTime Tcl Python UPF GPU microarchitecture EDA tools Power vector generation Power analysis Synthesis and place-and-route tools Advanced process nodes

Save