Senior Software Engineer, AI Resiliency

Nvidia

Actively hiring
Redmond, WA · Santa Clara, CA Posted 114 days ago $184,000$287,500 / year

At a glance

AI generated

TL;DR

Join NVIDIA’s AI Resiliency team as a Senior Software Engineer where you will lead the development of critical resiliency features for AI supercomputers with 100,000+ GPUs. Your daily tasks include implementing and optimizing software features such as fast checkpoint-recovery and error detection to enhance system reliability at scale, contributing high-quality C++ and Python code to large-scale distributed systems, and developing monitoring tools to mitigate failures proactively. You will collaborate closely with senior engineers and researchers to integrate resiliency into AI frameworks like PyTorch and JAX/XLA, while also supporting production deployments by debugging and tuning performance in cloud and HPC environments. Ideal candidates have a strong background in computer science or electrical engineering, 6+ years of relevant experience, proficiency in C++ and Python, and expertise in distributed systems, parallel programming, and fault tolerance. Experience with CUDA, NCCL, MPI, and AI frameworks is highly desirable.

Skills

C++ Python PyTorch JAX/XLA CUDA NCCL MPI CI/CD gdb perf valgrind NVIDIA Nsight Distributed Systems Fault Tolerance Checkpointing Strategies HPC Cloud Computing

What you'll do

  • Develop and optimize software features for fast checkpoint-recovery in AI supercomputers.
  • Implement error detection and isolation techniques to enhance AI system reliability.
  • Write high-quality C++ and Python code for large-scale distributed AI systems.
  • Assist in developing monitoring tools to proactively mitigate system failures.
  • Contribute to CI/CD pipelines for automated validation of AI workloads.

What we're looking for

  • 6+ years of experience in software engineering with a focus on distributed systems.
  • Proficiency in C++ and Python for developing high-performance code.
  • Strong understanding of fault tolerance and parallel programming concepts.
  • Experience with AI frameworks like PyTorch, JAX/XLA, TensorFlow.
  • Familiarity with debugging tools such as gdb, perf, valgrind, NVIDIA Nsight.
  • Knowledge of checkpointing strategies and error mitigation in large-scale systems.
  • Hands-on experience with CUDA, NCCL, or MPI for GPU-accelerated computing.

Market check

Salary context

This $184,000–$287,500 range sits above 78% of similar postings on FindRole.

Peer median band

$152,000$237,600

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$156,875$235,750

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Senior Software Engineer, AI Resiliency

Nvidia

Us, Wa, Redmond, US 17 days ago $184,000$287,500
C++ Python PyTorch JAX/XLA CUDA NCCL MPI CI/CD gdb perf valgrind NVIDIA Nsight Distributed Systems Fault Tolerance High-Performance Computing Checkpointing Strategies

Senior Software Engineer, Agentic AI

Nvidia

Us, Wa, Redmond, US 31 days ago $152,000$241,500
Python C++ Rust CUDA Nsight Compute Nsight Systems TRT-LLM SGLang vLLM Transformer Engine Docker Kubernetes CI/CD PostgreSQL Git GitHub NVIDIA GPUs Agentic AI Distributed Systems

Senior Software Engineer, AI Networking

Nvidia

Us, Ca, Santa Clara, US 15 days ago $152,000$241,500
Python PyTorch TensorFlow JAX CUDA NCCL Reinforcement_Learning Bayesian_Optimization GNNs Docker Kubernetes CI/CD Prometheus Grafana Bash C++ PostgreSQL Redis

AI Software Engineer, Senior

Booz Allen Hamilton

Locations Laurel, Maryland, US 42 days ago $86,800$198,000
Python Java C++ JavaScript TypeScript LLM-powered developer tools CI/CD DevOps VS Code Kubernetes Docker GitHub GitLab Jenkins Agentic AI frameworks Orchestration systems Cloud services PostgreSQL MongoDB

AI Software Engineer, Senior

Booz Allen Hamilton

US 42 days ago $86,800$198,000
Python Rust Go Scala Java GitLab CI Jenkins Git Linux Docker Podman AWS LocalStack ESXi Ansible Kubernetes SIEM Security+ Linux+

Senior Software Engineer (AI Platform)

Smartly

US 42 days ago
Python TypeScript PostgreSQL Node.js Docker Kubernetes React AWS GCP CI/CD MLOps PyTorch TensorFlow MLflow Kubeflow