Principal Software Engineer, At-Scale Reliability and Fleet Intelligence

Nvidia

Quick summary

Work type: On-site
Location: Santa Clara, CA
Salary: $272,000–$431,250 / yr
Posted: 3 days ago
Nearby: 99+ roles within 25 mi

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $207k

This role $352k

$123k most similar roles pay here $464k

This role pays more than 99% of similar roles. Most pay $184,000–$230,160 — the shaded band above. At the midpoint, this role pays about $352k versus about $207k for comparable roles.

Based on 239 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 966 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 955 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Principal Software Engineer, At-Scale Reliability and Fleet Intelligence

Role Posting Log in to save

As a Principal Software Engineer on the CSP Engagements team, you will focus on enhancing fleet-scale reliability by working closely with engineering teams at key cloud service providers. Your daily tasks include driving work streams to build shared understanding of reliability architecture, gathering and synthesizing fleet data to identify systemic issues, and defining consistent MTBI measurement methodologies across different environments. You will also conduct failure pattern analysis using statistical methods, integrate health monitoring systems, and collaborate on burn-in test criteria with quality teams. Essential skills include deep expertise in multi-NUMA system software, fleet-level telemetry, hardware failure modes, and predictive maintenance approaches. This role requires 15+ years of experience in large-scale reliability engineering or datacenter systems software, along with a strong background in statistical analysis and customer-focused problem-solving.

Skills

Pareto Weibull time-series databases anomaly detection health scoring event correlation NVIDIA GPU error taxonomy Xid errors NVLink error counters thermal events CPER records predictive failure models fleet reliability MTBF MTBI burn-in testing stress testing certification frameworks hardware health telemetry pipelines

What you'll do

Drive reliability work streams with CSP engineering teams to ensure shared understanding of MTBI methodology.
Gather and synthesize CSP fleet reliability data to identify cross-customer failure patterns.
Define consistent MTBI measurement methodologies that adapt across different CSP environments.
Conduct statistical analysis on fleet-scale failures to classify issues as systemic or environmental.
Drive integration of NVIDIA's health monitoring systems with CSP operational workflows.
Collaborate with quality teams to define and validate burn-in reliability test criteria.
Develop predictive failure models using fleet telemetry data and validate their effectiveness.

What we're looking for

15+ years experience in systems software at datacenter scale or reliability engineering.
BS or MS in Computer Science, Electrical Engineering, Statistics, or related field.
Expertise in multi-NUMA, rack-scale system software, firmware, and statistical failure analysis.
Experience with fleet-level telemetry, observability systems, and hardware failure modes.
Background in defining burn-in, stress testing, and certification frameworks for complex systems.
Strong communication skills to present reliability findings to technical and executive audiences.
Customer obsession and passion for translating fleet reliability challenges into engineering priorities.

Similar roles

Principal Software Engineer, Rack-Scale System Software

Nvidia

Remote (Santa Clara, CA) +1 3 days ago $272,000–$431,250

NVIDIA_NVSwitch GPU_fabric_management cluster_management fleet_orchestration health_monitoring_APIs telemetry_systems firmware_update_orchestration error_handling_frameworks system_level_orchestration fabric_management_software multi_component_coordination distributed_systems_error_handling health_scoring event_correlation API_design GPU_drivers device_management power_management NVIDIA_NVOS CI/CD

Remote

Save

Principal Software Engineer, E2E Performance and Goodput

Nvidia

Remote (Santa Clara, CA) +2 3 days ago $272,000–$431,250

nsight nsight_compute DCGM_metrics Python pandas CI/CD NVIDIA_DGX NVIDIA_HGX NVLink Megatron_LM DeepSpeed FSDP TensorRT vLLM SGLang AWS Azure GCP Kubernetes Prometheus Grafana

Remote

Save

Principal Software Engineer, Rack Scale Systems Infrastructure

Nvidia

Remote (Santa Clara, CA) +3 6 days ago $272,000–$431,250

Kubernetes Rust Go C++ Linux InfiniBand Ethernet RDMA BMCs Redfish IPMI Raspberry Pi OS Git API design Documentation AI-assisted development tools CI/CD Docker Networking protocols Firmware lifecycle Security protocols Open source software

Remote

Save

Principal Software Engineer, GPU Firmware and GPU System Software

Nvidia

Remote (Santa Clara, CA) +2 3 days ago $272,000–$431,250

NVIDIA GPU VBIOS InfoROM microcontroller firmware firmware update lifecycle management multi-GPU fabric architectures NVLink GPU driver stack Xid errors thermal events power events ECC counters secure boot chain code signing attestation debug authentication multi-tenancy isolation GPU power management architecture CI/CD

Remote

Save

Leader, Software Engineering

Cisco

Milpitas, CA 14 days ago $183,800–$263,600

Python PyTest SONiC BGP ECMP VXLAN L2/L3 networking RDMA HPC networks RoCE InfiniBand CI/CD

Hybrid

Save

Senior Principal Software Engineer, Cloud Infrastructure Security and Distributed Systems

Oracle

Nashville, TN +1 31 days ago $96,800–$306,400

Python Go Java C++ Docker Kubernetes Terraform Linux PostgreSQL CI/CD AWS Oracle Zero Trust Consensus Protocols Formal Verification_methods

Save