Principal Software Engineer, PyTorch Training Frameworks

Amd

Hybrid

Quick summary

Work type: Hybrid
Location: San Jose, CASeattle, WAAustin, TX
Posted: 148 days ago
Closes: Feb 26, 2027
Nearby: 99+ roles within 25 mi

Market check

Salary context

How this pay compares to similar roles

Similar $219k

$161k most similar roles pay here $273k

This listing doesn't post a salary. Most similar roles pay $191,000–$246,150.

Based on 240 similar postings.

Employer

About Amd

AMD (Advanced Micro Devices) is a semiconductor company that develops high-performance processors, graphics cards, and adaptive computing solutions for gaming, data centers, and embedded markets. Industry: Semiconductors

Amd currently has 56 open roles on FindRole.

Most-posted roles

View all roles at Amd

At a glance

TL;DR · Principal Software Engineer, PyTorch Training Frameworks

Role Posting Log in to save

AMD seeks a Principal-level Software Development Engineer with expertise in PyTorch training frameworks to enhance performance and scalability of AI training on AMD Instinct accelerators. This role involves optimizing distributed training, resolving hardware-related issues, and contributing to upstream PyTorch projects. The ideal candidate will lead technical initiatives, mentor engineers, and engage with strategic partners to ensure robust developer experiences. Key skills include deep knowledge of PyTorch internals, proficiency in Python and C/C++, experience with distributed training concepts like DDP and FSDP, and strong performance engineering capabilities. Familiarity with AMD’s ROCm ecosystem and Linux-based environments is essential for driving impactful solutions at scale.

Skills

PyTorch Python C++ DDP FSDP NCCL RCCL CUDA HIP Linux CI/CD Docker Git Triton TorchInductor torch.compile Profiling Tracing Memory Optimization Performance Engineering

What you'll do

Act as technical authority for PyTorch training at AMD.
Improve and debug performance in areas like DDP/FSDP, gradient checkpointing.
Partner with ROCm teams to resolve full-stack performance bottlenecks and issues.
Contribute to upstream PyTorch by influencing design discussions and code contributions.
Develop and maintain benchmarks and profiling workflows for key models.
Lead investigations of performance regressions and correctness issues across teams.

What we're looking for

Deep experience with PyTorch internals and distributed training systems
Strong performance engineering skills including profiling, tracing, and memory optimization
Expertise in Python and C/C++ programming for large codebases
Familiarity with PyTorch ecosystem components like TorchInductor and CUDA/HIP models
Ability to lead technical discussions and influence architectural decisions across teams
Experience working on Linux-based environments with OS/hardware integration
Clear communication skills for design documentation, code reviews, and stakeholder updates

Similar roles

Senior Software Engineer, PyTorch, Deep Learning

Nvidia

Remote 6 days ago $152,000–$241,500

PyTorch C++ CUDA Python Distributed Parallel Programming Thread Deep Learning Compilers Deep Learning Modeling Trends Multi-disciplinary Teams CI/CD

Remote

Save

Principal / Senior GPU Software Performance Engineer, Post-Training

Amd

CA 57 days ago

Python C++ PyTorch ROCm HIP Triton torch.distributed FSDP ZeRO CUDA Docker Kubernetes Git GitHub Jenkins Slack Zoom Markdown Confluence Bash SQL PostgreSQL Prometheus Grafana CI/CD

Hybrid

Save

Fellow GPU Performance Optimization Engineer

Amd

San Jose, CA 92 days ago

AMD GPU ROCm Nsight Python PyTorch JAX TensorFlow Megatron-LM Torchtitan MaxText NCCL RCCL C++ CUDA HIP Distributed Training Performance Profiling Bottleneck Analysis Compiler Optimization Graph-Level Optimization

Hybrid

Save

Machine Learning Software Engineer

Apple Inc

Sunnyvale, CA 100 days ago $181,100–$318,400

Python C++ Swift iOS macOS Machine Learning Computer Vision Cloud Services CI/CD Docker Kubernetes Terraform Git Jupyter Notebook TensorFlow PyTorch Scikit-learn Pandas NumPy

Save

Principal Software Engineer, Machine Learning Simulations

Upstart

Remote (Canada) 25 days ago $195,300–$270,400

Python AWS Kubernetes Docker Flask FastAPI MLflow Metaflow gRPC Kafka Spark PySpark Redshift Terraform CI/CD MLOps Ray Prometheus Grafana

Remote

Save

Senior Software Engineer, Developer Tools for Deep Learning

Nvidia

Remote (MA) 23 days ago $152,000–$241,500

PyTorch TensorFlow JAX Python C++ ONNX TensorRT CI/CD Docker Kubernetes GitHub NVIDIA_Deep_Learning_Stacks PostgreSQL MongoDB Git Linux CUDA REST_APIs Swagger

Remote

Save