Principal Software Engineer, Rack-Scale System Software

Nvidia

Remote

Quick summary

Work type
Remote
Location
Santa Clara, CAAustin, TX
Salary
$272,000–$431,250 / yr
Posted
3 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $209k
This role $352k
$119k most similar roles pay here $465k

This role pays more than 99% of similar roles. Most pay $187,475–$231,000 — the shaded band above. At the midpoint, this role pays about $352k versus about $209k for comparable roles.

Based on 239 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 966 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 955 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Principal Software Engineer, Rack-Scale System Software

As a Principal Software Engineer on the CSP Engagements team at NVIDIA, you will serve as the technical liaison for rack-scale system software and firmware, collaborating with engineering teams to ensure reliable deployment and operation of systems at fleet scale. Your daily tasks include driving architecture alignment across engagements, managing technical work streams with CSPs, capturing operational feedback, and integrating customer requirements into development processes. You will also identify cross-CSP patterns in SW/FW issues and drive improvements in documentation and tooling. The ideal candidate has over 15 years of experience in system software or large-scale distributed systems engineering, with expertise in fabric management, cluster orchestration, health monitoring APIs, and GPU system software. Familiarity with NVIDIA NVSwitch and GPU fabric management is a plus. This role requires deep technical knowledge and strong communication skills to mentor customer teams effectively.

What you'll do

  • Drive alignment on rack-scale SW/FW architecture across CSP engagements.
  • Lead technical work streams with CSP teams to ensure deep understanding of system software.
  • Capture and synthesize CSP engineering feedback for NVIDIA's architecture decisions.
  • Identify cross-CSP patterns in rack-scale issues and drive improvements in documentation.
  • Collaborate on left-shift strategy to identify and complete SW/FW integration early.

What we're looking for

  • 15+ years experience in system software, platform firmware, or large-scale distributed systems engineering.
  • Deep understanding of rack-scale system challenges including multi-component coordination and health monitoring.
  • Experience with fabric management, cluster management, and system-level orchestration frameworks.
  • Understanding of error handling and recovery design patterns in distributed systems.
  • Familiarity with health monitoring and telemetry systems for fleet-level observability.
  • Proven technical leadership across organizational boundaries and strong communication skills.
  • Background in system software for large-scale clusters at a hyperscaler.

More like this

Similar roles

Principal Software Engineer, At-Scale Reliability and Fleet Intelligence

Nvidia

Santa Clara, CA 3 days ago $272,000$431,250
Pareto Weibull time-series databases anomaly detection health scoring event correlation NVIDIA GPU error taxonomy Xid errors NVLink error counters thermal events CPER records predictive failure models fleet reliability MTBF MTBI burn-in testing stress testing certification frameworks hardware health telemetry pipelines

Principal Software Engineer, Rack Scale Systems Infrastructure

Nvidia

Remote (Santa Clara, CA) +3 6 days ago $272,000$431,250
Kubernetes Rust Go C++ Linux InfiniBand Ethernet RDMA BMCs Redfish IPMI Raspberry Pi OS Git API design Documentation AI-assisted development tools CI/CD Docker Networking protocols Firmware lifecycle Security protocols Open source software
Remote

Principal Software Engineer, E2E Performance and Goodput

Nvidia

Remote (Santa Clara, CA) +2 3 days ago $272,000$431,250
nsight nsight_compute DCGM_metrics Python pandas CI/CD NVIDIA_DGX NVIDIA_HGX NVLink Megatron_LM DeepSpeed FSDP TensorRT vLLM SGLang AWS Azure GCP Kubernetes Prometheus Grafana
Remote

Principal Software Engineer, GPU Firmware and GPU System Software

Nvidia

Remote (Santa Clara, CA) +2 3 days ago $272,000$431,250
NVIDIA GPU VBIOS InfoROM microcontroller firmware firmware update lifecycle management multi-GPU fabric architectures NVLink GPU driver stack Xid errors thermal events power events ECC counters secure boot chain code signing attestation debug authentication multi-tenancy isolation GPU power management architecture CI/CD
Remote

Senior Software Engineer, NVLink Rack Scale Stability and Reliability

Nvidia

Remote (Santa Clara, CA) 6 days ago $152,000$241,500
Python C/C++ Bash Shell scripting TCP/IP Ethernet InfiniBand RDMA RoCE routing switching NVIDIA GPU systems NVLink NVSwitch CUDA PCIe memory hierarchy DMA high-speed interconnects distributed training server management technologies data center operations cluster provisioning fleet monitoring CI/CD pipelines diagnostics automation dashboards
Remote