Principal Software Engineer, PyTorch Training Frameworks

Amd

Hybrid

Quick summary

Work type
Hybrid
Location
San Jose, CASeattle, WAAustin, TX
Posted
148 days ago
Closes
Feb 26, 2027

Market check

Salary context

How this pay compares to similar roles

Similar $219k
$161k most similar roles pay here $273k

This listing doesn't post a salary. Most similar roles pay $191,000–$246,150.

Based on 240 similar postings.

Employer

About Amd

AMD (Advanced Micro Devices) is a semiconductor company that develops high-performance processors, graphics cards, and adaptive computing solutions for gaming, data centers, and embedded markets. Industry: Semiconductors

Amd currently has 56 open roles on FindRole.

Most-posted roles

View all roles at Amd

At a glance

TL;DR · Principal Software Engineer, PyTorch Training Frameworks

AMD seeks a Principal-level Software Development Engineer with expertise in PyTorch training frameworks to enhance performance and scalability of AI training on AMD Instinct accelerators. This role involves optimizing distributed training, resolving hardware-related issues, and contributing to upstream PyTorch projects. The ideal candidate will lead technical initiatives, mentor engineers, and engage with strategic partners to ensure robust developer experiences. Key skills include deep knowledge of PyTorch internals, proficiency in Python and C/C++, experience with distributed training concepts like DDP and FSDP, and strong performance engineering capabilities. Familiarity with AMD’s ROCm ecosystem and Linux-based environments is essential for driving impactful solutions at scale.

What you'll do

  • Act as technical authority for PyTorch training at AMD.
  • Improve and debug performance in areas like DDP/FSDP, gradient checkpointing.
  • Partner with ROCm teams to resolve full-stack performance bottlenecks and issues.
  • Contribute to upstream PyTorch by influencing design discussions and code contributions.
  • Develop and maintain benchmarks and profiling workflows for key models.
  • Lead investigations of performance regressions and correctness issues across teams.

What we're looking for

  • Deep experience with PyTorch internals and distributed training systems
  • Strong performance engineering skills including profiling, tracing, and memory optimization
  • Expertise in Python and C/C++ programming for large codebases
  • Familiarity with PyTorch ecosystem components like TorchInductor and CUDA/HIP models
  • Ability to lead technical discussions and influence architectural decisions across teams
  • Experience working on Linux-based environments with OS/hardware integration
  • Clear communication skills for design documentation, code reviews, and stakeholder updates

More like this

Similar roles

Fellow GPU Performance Optimization Engineer

Amd

San Jose, CA 92 days ago
AMD GPU ROCm Nsight Python PyTorch JAX TensorFlow Megatron-LM Torchtitan MaxText NCCL RCCL C++ CUDA HIP Distributed Training Performance Profiling Bottleneck Analysis Compiler Optimization Graph-Level Optimization
Hybrid

Machine Learning Software Engineer

Apple Inc

Sunnyvale, CA 100 days ago $181,100$318,400
Python C++ Swift iOS macOS Machine Learning Computer Vision Cloud Services CI/CD Docker Kubernetes Terraform Git Jupyter Notebook TensorFlow PyTorch Scikit-learn Pandas NumPy