Senior Software Engineer, AI Resiliency
Nvidia
At a glance
AI generatedJoin NVIDIA’s AI Resiliency team as a Senior Software Engineer where you will lead the development of critical resiliency features for AI supercomputers with 100,000+ GPUs. Your daily tasks include implementing and optimizing software features such as fast checkpoint-recovery and error detection to enhance system reliability at scale. You will write high-quality C++ and Python code, collaborate with senior engineers and researchers to integrate resiliency into frameworks like PyTorch and JAX/XLA, and develop automated tests for robustness and efficiency. Proficiency in C++, Python, and distributed systems concepts is essential, along with experience in AI frameworks, debugging tools, and large-scale AI clusters. This role offers the chance to work on cutting-edge challenges in high-performance computing and make significant contributions to advancing AI infrastructure reliability.
Skills
What you'll do
What we're looking for
Market check
This $184,000–$287,500 range sits above 78% of similar postings on FindRole.
Peer median band
$152,000–$237,600
Median floor and ceiling across peers.
Typical midpoint (25–75%)
$159,450–$235,750
Middle half of comparable postings.
Based on 240 comparable postings.
* 240 is the maximum number of comparable postings sampled.
Employer
Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing
Nvidia currently has 801 open roles on FindRole.
Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.
Most-posted roles
More like this
Nvidia
Nvidia
Nvidia
Booz Allen Hamilton
Booz Allen Hamilton
Smartly