Senior System Architect, Infrastructure Reliability

Nvidia

Hybrid Actively hiring

Santa Clara, CA · Westford, MA · Austin, TX · Durham, NC · Redmond, WA Posted 93 days ago $184,000–$287,500 / year

View original post Log in to save

At a glance

AI generated

TL;DR

NVIDIA is hiring a Senior System Architect to address the challenge of Failure Attribution at Scale within its accelerated computing division, focusing on developing an automated framework that captures high-fidelity state data from CPU and GPU clusters to identify job failures in real-time. This role involves architecting flight recorders for EDA jobs, building diagnostics to correlate hardware faults with system-level events, implementing distributed logging and tracing mechanisms, and creating heuristics based on machine learning to classify failure types. The ideal candidate will have a deep understanding of CPU architecture, proficiency in C++ and Python, experience with cluster resource managers like Slurm or Kubernetes, expertise in Linux kernel diagnostics, and familiarity with NVIDIA’s DCGM and NVML for GPU monitoring. This position requires extensive knowledge of distributed systems and hands-on experience with automated RCA pipelines in HPC environments.

Skills

Python C++ Kubernetes Slurm NVIDIA DCGM NVML Linux kernel CUDA Prometheus Grafana CI/CD Machine Learning Docker CRIU Tracing tools PostgreSQL Redis MESOS Hadoop

What you'll do

Architect scalable "flight recorder" for EDA jobs capturing high-fidelity state across CPU, GPU, and Fabric at failure moments.
Build automated diagnostics correlating GPU XID errors with system-level events like OOM kills or NUMA-related hangs.
Implement low-overhead tracing mechanisms providing access to job execution data in multi-node Slurm or Kubernetes clusters.
Develop heuristics and models using machine learning to classify failures as hardware faults, software bugs, or environment issues.
Work with hardware and infrastructure teams to define signals of impending failure for proactive job migration or checkpointing.

What we're looking for

6+ years of experience in systems programming or equivalent education in Computer Science/Electrical Engineering
Expertise in building automated RCA pipelines for HPC or cloud-scale environments
Deep knowledge of x86/ARM CPU architecture metrics and Linux kernel error reporting interfaces
Proficiency in C++ and Python for developing high-performance system monitoring daemons
Experience with cluster resource managers like Slurm, Kubernetes, and their job lifecycle management
Expertise in NVIDIA DCGM and NVML for GPU health monitoring and state-dump capture
Familiarity with non-intrusive application health monitoring tools and checkpoint/restore technologies

Market check

Salary context

This $184,000–$287,500 range sits above 87% of similar postings on FindRole.

Peer median band

$141,720–$225,000

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$142,437–$233,406

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

Similar roles

Senior Software and System Architect

Nvidia

Remote (Us, Ca, Santa Clara, US) 114 days ago $152,000–$241,500

C Python Linux Docker Kubernetes DPDK OVS OVN

Remote

Senior Principal Reliability Engineer

Medtronic

Usa-Mn Mounds View Central, US 14 days ago $163,200–$244,800

ISO 14971 Post-Market Surveillance Health Risk Assessments Root Cause Analysis Statistical Analysis Technical Writing Regulatory Compliance Quality Management Systems Mentorship Risk Management Field Corrective Action Cross-Functional Collaboration Executive Communication

Senior Core Infrastructure Engineer

Highnote

US 83 days ago $170,000–$230,000

GCP AWS Kubernetes Istio Python Java CI/CD Prometheus Grafana Spanner BigQuery Dataflow Pub/Sub

Senior/Staff System Architect

Qualcomm

Novi, Mi,Us, US 51 days ago $100,200–$150,200

SysML MagicDraw Enterprise_Architect ISO26262 ISO21434 ISO21448 C++ Python SCRUM Kanban CI/CD

Senior Reliability Engineer

JLL (Jones Lang LaSalle)

Remote (Usa-Client Jersey City Nj-Goldman Sachs - 30 Hudson, US) 45 days ago $140,000–$160,000

Excel CMMS EAM ISO9001 ISO55001 RCM CbM PdM BAS PLC SCADA SQL Python R Tableau Building Automation Systems Energy Management Platforms Vibration Analysis Oil Analysis Infrared Thermography Ultrasound Motor Current Analysis CMMS/EAM systems Microsoft Excel

Remote

Senior Software Architect (Systems)

Boeing

Remote (Usa - Tukwila, Wa, US) 14 days ago $190,400–$257,600

Python Kubernetes Docker CI/CD MBSE CAMEO Agile Model-Based Systems Engineering Open Mission Systems Groovy Terraform AWS Git Jenkins

Remote