Senior Software and System Architect
Nvidia
At a glance
AI generatedNVIDIA is hiring a Senior System Architect to address the challenge of Failure Attribution at Scale within its accelerated computing division, focusing on developing an automated framework that captures high-fidelity state data from CPU and GPU clusters to identify job failures in real-time. This role involves architecting flight recorders for EDA jobs, building diagnostics to correlate hardware faults with system-level events, implementing distributed logging and tracing mechanisms, and creating heuristics based on machine learning to classify failure types. The ideal candidate will have a deep understanding of CPU architecture, proficiency in C++ and Python, experience with cluster resource managers like Slurm or Kubernetes, expertise in Linux kernel diagnostics, and familiarity with NVIDIA’s DCGM and NVML for GPU monitoring. This position requires extensive knowledge of distributed systems and hands-on experience with automated RCA pipelines in HPC environments.
Skills
What you'll do
What we're looking for
Market check
This $184,000–$287,500 range sits above 87% of similar postings on FindRole.
Peer median band
$141,720–$225,000
Median floor and ceiling across peers.
Typical midpoint (25–75%)
$142,437–$233,406
Middle half of comparable postings.
Based on 240 comparable postings.
* 240 is the maximum number of comparable postings sampled.
Employer
Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing
Nvidia currently has 801 open roles on FindRole.
Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.
Most-posted roles
More like this
Nvidia
Medtronic
Highnote
Qualcomm
JLL (Jones Lang LaSalle)
Boeing