Principal / Senior GPU Software Performance Engineer, Post-Training

Amd

Hybrid

Quick summary

Work type: Hybrid
Location: CA
Posted: 94 days ago
Closes: Mar 25, 2027
Nearby: 99+ roles within 25 mi

Market check

Salary context

How this pay compares to similar roles

Similar $211k

$155k most similar roles pay here $270k

This listing doesn't post a salary. Most similar roles pay $185,943–$235,750.

Based on 240 similar postings.

Employer

About Amd

AMD (Advanced Micro Devices) is a semiconductor company that develops high-performance processors, graphics cards, and adaptive computing solutions for gaming, data centers, and embedded markets. Industry: Semiconductors

Amd currently has 56 open roles on FindRole.

Most-posted roles

View all roles at Amd

At a glance

TL;DR · Principal / Senior GPU Software Performance Engineer, Post-Training

Role Posting Log in to save

As a senior software engineer on the AMD Instinct GPU team, you will drive performance for post-training workloads by optimizing finetuning and reinforcement learning (RL) training solutions across data loaders, kernels, distributed training, and compilers. Your daily tasks include enhancing throughput, memory efficiency, and stability in multi-GPU/multi-node environments while contributing efficient kernels and targeted graph-level optimizations. You will profile, diagnose, and resolve bottlenecks using standard tooling to prevent regressions in continuous integration (CI) systems, ensuring reproducible pipelines and documentation are adopted by internal teams and external developers. Ideal candidates have experience with GPU performance engineering for deep learning on ROCm/HIP or similar platforms, hands-on expertise with SFT, LoRA, and RL-based training at scale, strong PyTorch skills, proficiency in Python and C++, and a track record of turning profiles into fixes and documenting results.

Skills

Python PyTorch C++ ROCm HIP AMD Instinct GPUs Distributed training Multi-GPU/multi-node SFT LoRA RL-based training torch.distributed FSDP ZeRO Distributed systems Collective communication libraries CI/CD

What you'll do

Lead performance optimization for finetuning and RL training on AMD GPUs.
Enhance throughput, memory efficiency, and stability in multi-GPU/multi-node setups.
Develop efficient kernels and graph-level optimizations for deep learning frameworks.
Profile and resolve bottlenecks using standard tooling to prevent CI regressions.
Ship reproducible pipelines and documentation for internal and external use.

What we're looking for

Proven GPU performance engineering for deep learning (ROCm/HIP, Triton, etc.)
Hands-on experience with SFT, LoRA, and RL-based training at scale
Strong PyTorch expertise including torch.distributed, FSDP/ZeRO or equivalent
Proficient in Python and C++; capable of reading/writing kernels
Experience optimizing multi-GPU/multi-node training and communication patterns
Track record of profiling, diagnosing, and resolving performance bottlenecks

Similar roles

Principal / Senior GPU Software Performance Engineer, Post-Training

Amd

CA 57 days ago

Python C++ PyTorch ROCm HIP Triton torch.distributed FSDP ZeRO CUDA Docker Kubernetes Git GitHub Jenkins Slack Zoom Markdown Confluence Bash SQL PostgreSQL Prometheus Grafana CI/CD

Hybrid

Save

Fellow GPU Performance Optimization Engineer

Amd

San Jose, CA 92 days ago

AMD GPU ROCm Nsight Python PyTorch JAX TensorFlow Megatron-LM Torchtitan MaxText NCCL RCCL C++ CUDA HIP Distributed Training Performance Profiling Bottleneck Analysis Compiler Optimization Graph-Level Optimization

Hybrid

Save

Senior Systems Software Engineer, GPU Performance at Scale

Nvidia

Remote (Santa Clara, CA) 6 days ago $184,000–$287,500

CUDA Slurm Python C C++ Bash Docker Linux Container Technology Virtualization HPC Environments Cloud Platform Solutions CI/CD

Remote

Save

Principal Software Engineer, PyTorch Training Frameworks

Amd

San Jose, CA +2 148 days ago

PyTorch Python C++ DDP FSDP NCCL RCCL CUDA HIP Linux CI/CD Docker Git Triton TorchInductor torch.compile Profiling Tracing Memory Optimization Performance Engineering

Hybrid

Save

Principal ML Engineer, Large Scale Training Performance Optimization

Amd

San Jose, CA +1 96 days ago

PyTorch TensorFlow JAX Megatron-LM MaxText TorchTitan Python C++ Distributed Training GPU Kernel Optimization CI/CD Prometheus Grafana

Hybrid

Save

Senior Silicon Design Engineer, Hardware-Software Co-Design, GPU Compute and AI Compilers & Runtimes

Amd

Washington 82 days ago

C C++ Python CUDA HIP Linux DRM/GEM ROCm IREE PyTorch Triton LLVM MLIR GPU HPC AI Compiler Performance Modeling System-Level Programming Kernel-Mode Drivers Memory Management Framework Internals Quantization Techniques Attention Mechanisms Mixture-of-Experts

Hybrid

Save