Distinguished Resiliency and Safety Architect, GPU Diagnostics

Nvidia

Quick summary

Work type
On-site
Location
Santa Clara, CA
Salary
$320,000–$488,750 / yr
Posted
103 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $206k
This role $404k
$130k most similar roles pay here $527k

This role pays more than 99% of similar roles. Most pay $177,250–$235,750 — the shaded band above. At the midpoint, this role pays about $404k versus about $206k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 563 open roles on FindRole.

Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Distinguished Resiliency and Safety Architect, GPU Diagnostics

As a Distinguished Resiliency and Safety Architect for GPU Diagnostics, you will join NVIDIA’s cutting-edge team to develop diagnostics software that stress tests GPUs and SOCs to identify hardware defects across various configurations. Your daily tasks include enhancing diagnostic suites to improve failure detection reliability and optimizing test times, ensuring compliance with ISO 26262 standards in automotive contexts. You will collaborate closely with architecture, RTL, and verification teams to ensure robustness and correctness of diagnostics across GPU generations. Essential skills for this role include a Master’s or PhD degree in Computer Science or related fields, at least 15 years of relevant experience, proficiency in C/C++ and CUDA programming, and scripting abilities with Python. Familiarity with GPU architectures, machine learning concepts, and silent data corruption mechanisms is highly desirable. This position demands expertise in high-performance computing systems and the ability to debug complex system-level issues, making it ideal for those passionate about real-time computing platforms in autonomous vehicles and industrial robotics.

What you'll do

  • Design and develop diagnostic software for stress testing NVIDIA GPUs and SOCs to identify hardware defects.
  • Enhance diagnostics to improve repeatability of failures detected and optimize test time based on silicon failures.
  • Create tests for GPUs in automotive functional safety contexts, adhering to ISO 26262 standards.
  • Investigate silent data corruption and intermittent faults in field returns to establish root causes and improve detection.
  • Support deployment of diagnostics in pre-production qualification environments and large-scale production usages.

What we're looking for

  • At least 15+ years of relevant experience in high-performance computing systems.
  • Master’s or PhD degree in Computer Science, Engineering, or related field.
  • Proficiency in C/C++, CUDA programming, and scripting with Python.
  • In-depth understanding of GPU/SOC architectures and hardware failure mechanisms.
  • Ability to reason across hardware/software boundaries for complex debugging.
  • Experience in embedded software development and stress testing for reliability.
  • Familiarity with ISO 26262 standards for automotive functional safety.

More like this

Similar roles

Principal System Architect, GPU

Nvidia

Remote (Santa Clara, CA) 23 days ago $272,000$431,250
SoC GPU AI RTL Verification Physical_design Firmware Software Memory_architecture Power_management High-level_programming_languages Silicon_bring-up Debugging Documentation Networking Multi-GPU_systems Packaging_technologies AI_workload_characteristics CI/CD
Remote