Senior System Architect, Infrastructure Reliability

Nvidia

Hybrid Actively hiring
Santa Clara, CA · Westford, MA · Austin, TX · Durham, NC · Redmond, WA Posted 93 days ago $184,000$287,500 / year

At a glance

AI generated

TL;DR

NVIDIA is hiring a Senior System Architect to address the challenge of Failure Attribution at Scale within its accelerated computing division, focusing on developing an automated framework that captures high-fidelity state data from CPU and GPU clusters to identify job failures in real-time. This role involves architecting flight recorders for EDA jobs, building diagnostics to correlate hardware faults with system-level events, implementing distributed logging and tracing mechanisms, and creating heuristics based on machine learning to classify failure types. The ideal candidate will have a deep understanding of CPU architecture, proficiency in C++ and Python, experience with cluster resource managers like Slurm or Kubernetes, expertise in Linux kernel diagnostics, and familiarity with NVIDIA’s DCGM and NVML for GPU monitoring. This position requires extensive knowledge of distributed systems and hands-on experience with automated RCA pipelines in HPC environments.

Skills

Python C++ Kubernetes Slurm NVIDIA DCGM NVML Linux kernel CUDA Prometheus Grafana CI/CD Machine Learning Docker CRIU Tracing tools PostgreSQL Redis MESOS Hadoop

What you'll do

  • Architect scalable "flight recorder" for EDA jobs capturing high-fidelity state across CPU, GPU, and Fabric at failure moments.
  • Build automated diagnostics correlating GPU XID errors with system-level events like OOM kills or NUMA-related hangs.
  • Implement low-overhead tracing mechanisms providing access to job execution data in multi-node Slurm or Kubernetes clusters.
  • Develop heuristics and models using machine learning to classify failures as hardware faults, software bugs, or environment issues.
  • Work with hardware and infrastructure teams to define signals of impending failure for proactive job migration or checkpointing.

What we're looking for

  • 6+ years of experience in systems programming or equivalent education in Computer Science/Electrical Engineering
  • Expertise in building automated RCA pipelines for HPC or cloud-scale environments
  • Deep knowledge of x86/ARM CPU architecture metrics and Linux kernel error reporting interfaces
  • Proficiency in C++ and Python for developing high-performance system monitoring daemons
  • Experience with cluster resource managers like Slurm, Kubernetes, and their job lifecycle management
  • Expertise in NVIDIA DCGM and NVML for GPU health monitoring and state-dump capture
  • Familiarity with non-intrusive application health monitoring tools and checkpoint/restore technologies

Market check

Salary context

This $184,000–$287,500 range sits above 87% of similar postings on FindRole.

Peer median band

$141,720$225,000

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$142,437$233,406

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Senior Principal Reliability Engineer

Medtronic

Usa-Mn Mounds View Central, US 14 days ago $163,200$244,800
ISO 14971 Post-Market Surveillance Health Risk Assessments Root Cause Analysis Statistical Analysis Technical Writing Regulatory Compliance Quality Management Systems Mentorship Risk Management Field Corrective Action Cross-Functional Collaboration Executive Communication

Senior Core Infrastructure Engineer

Highnote

US 83 days ago $170,000$230,000
GCP AWS Kubernetes Istio Python Java CI/CD Prometheus Grafana Spanner BigQuery Dataflow Pub/Sub

Senior/Staff System Architect

Qualcomm

Novi, Mi,Us, US 51 days ago $100,200$150,200
SysML MagicDraw Enterprise_Architect ISO26262 ISO21434 ISO21448 C++ Python SCRUM Kanban CI/CD

Senior Reliability Engineer

JLL (Jones Lang LaSalle)

Remote (Usa-Client Jersey City Nj-Goldman Sachs - 30 Hudson, US) 45 days ago $140,000$160,000
Excel CMMS EAM ISO9001 ISO55001 RCM CbM PdM BAS PLC SCADA SQL Python R Tableau Building Automation Systems Energy Management Platforms Vibration Analysis Oil Analysis Infrared Thermography Ultrasound Motor Current Analysis CMMS/EAM systems Microsoft Excel
Remote

Senior Software Architect (Systems)

Boeing

Remote (Usa - Tukwila, Wa, US) 14 days ago $190,400$257,600
Python Kubernetes Docker CI/CD MBSE CAMEO Agile Model-Based Systems Engineering Open Mission Systems Groovy Terraform AWS Git Jenkins
Remote