Senior Software Engineer, AI Resiliency
Nvidia
At a glance
AI generatedJoin NVIDIA’s AI Resiliency team as a Senior Software Engineer where you will lead the development of critical resiliency features for AI supercomputers with 100,000+ GPUs. Your daily tasks include implementing and optimizing software features such as fast checkpoint-recovery and error detection to enhance system reliability at scale, contributing high-quality C++ and Python code to large-scale distributed systems, and developing monitoring tools to mitigate failures proactively. You will collaborate closely with senior engineers and researchers to integrate resiliency into AI frameworks like PyTorch and JAX/XLA, while also supporting production deployments by debugging and tuning performance in cloud and HPC environments. Ideal candidates have a strong background in computer science or electrical engineering, 6+ years of relevant experience, proficiency in C++ and Python, and expertise in distributed systems, parallel programming, and fault tolerance. Experience with CUDA, NCCL, MPI, and AI frameworks is highly desirable.
Skills
What you'll do
What we're looking for
Market check
This $184,000–$287,500 range sits above 78% of similar postings on FindRole.
Peer median band
$152,000–$237,600
Median floor and ceiling across peers.
Typical midpoint (25–75%)
$156,875–$235,750
Middle half of comparable postings.
Based on 240 comparable postings.
* 240 is the maximum number of comparable postings sampled.
Employer
Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing
Nvidia currently has 801 open roles on FindRole.
Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.
Most-posted roles
More like this
Nvidia
Nvidia
Nvidia
Booz Allen Hamilton
Booz Allen Hamilton
Smartly