Senior Software Engineer, AI Resiliency

Nvidia

Actively hiring
Redmond, WA · Santa Clara, CA Posted 17 days ago $184,000$287,500 / year

At a glance

AI generated

TL;DR

Join NVIDIA’s AI Resiliency team as a Senior Software Engineer where you will lead the development of critical resiliency features for AI supercomputers with 100,000+ GPUs. Your daily tasks include implementing and optimizing software features such as fast checkpoint-recovery and error detection to enhance system reliability at scale. You will write high-quality C++ and Python code, collaborate with senior engineers and researchers to integrate resiliency into frameworks like PyTorch and JAX/XLA, and develop automated tests for robustness and efficiency. Proficiency in C++, Python, and distributed systems concepts is essential, along with experience in AI frameworks, debugging tools, and large-scale AI clusters. This role offers the chance to work on cutting-edge challenges in high-performance computing and make significant contributions to advancing AI infrastructure reliability.

Skills

C++ Python PyTorch JAX/XLA CUDA NCCL MPI CI/CD gdb perf valgrind NVIDIA Nsight Distributed Systems Fault Tolerance High-Performance Computing Checkpointing Strategies

What you'll do

  • Develop and optimize software features for fast checkpoint-recovery in AI supercomputers.
  • Implement error detection and isolation techniques to enhance system reliability at scale.
  • Write high-quality C++ and Python code for large-scale distributed AI systems.
  • Assist in developing monitoring tools to proactively mitigate potential failures.
  • Contribute to CI/CD pipelines to automate validation of resiliency mechanisms.

What we're looking for

  • 6+ years of experience in software engineering with a focus on distributed systems.
  • Proficiency in C++ and Python for developing high-performance code.
  • Strong understanding of fault tolerance and parallel programming concepts.
  • Experience with AI frameworks like PyTorch, JAX/XLA, TensorFlow.
  • Familiarity with debugging tools such as gdb, perf, valgrind, NVIDIA Nsight.
  • Hands-on experience in training models or working closely with model training teams.
  • Knowledge of checkpointing strategies and fault-tolerant computing in AI training.

Market check

Salary context

This $184,000–$287,500 range sits above 78% of similar postings on FindRole.

Peer median band

$152,000$237,600

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$159,450$235,750

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Senior Software Engineer, AI Resiliency

Nvidia

Us, Wa, Redmond, US 114 days ago $184,000$287,500
C++ Python PyTorch JAX/XLA CUDA NCCL MPI CI/CD gdb perf valgrind NVIDIA Nsight Distributed Systems Fault Tolerance Checkpointing Strategies HPC Cloud Computing

Senior Software Engineer, Agentic AI

Nvidia

Us, Wa, Redmond, US 31 days ago $152,000$241,500
Python C++ Rust CUDA Nsight Compute Nsight Systems TRT-LLM SGLang vLLM Transformer Engine Docker Kubernetes CI/CD PostgreSQL Git GitHub NVIDIA GPUs Agentic AI Distributed Systems

Senior Software Engineer, AI Networking

Nvidia

Us, Ca, Santa Clara, US 15 days ago $152,000$241,500
Python PyTorch TensorFlow JAX CUDA NCCL Reinforcement_Learning Bayesian_Optimization GNNs Docker Kubernetes CI/CD Prometheus Grafana Bash C++ PostgreSQL Redis

AI Software Engineer, Senior

Booz Allen Hamilton

Locations Laurel, Maryland, US 42 days ago $86,800$198,000
Python Java C++ JavaScript TypeScript LLM-powered developer tools CI/CD DevOps VS Code Kubernetes Docker GitHub GitLab Jenkins Agentic AI frameworks Orchestration systems Cloud services PostgreSQL MongoDB

AI Software Engineer, Senior

Booz Allen Hamilton

US 42 days ago $86,800$198,000
Python Rust Go Scala Java GitLab CI Jenkins Git Linux Docker Podman AWS LocalStack ESXi Ansible Kubernetes SIEM Security+ Linux+

Senior Software Engineer (AI Platform)

Smartly

US 42 days ago
Python TypeScript PostgreSQL Node.js Docker Kubernetes React AWS GCP CI/CD MLOps PyTorch TensorFlow MLflow Kubeflow