Principal Software Engineer, E2E Performance and Goodput

Nvidia

Remote

Quick summary

Work type
Remote
Location
Santa Clara, CAAustin, TXOR
Salary
$272,000–$431,250 / yr
Posted
3 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $203k
This role $352k
$115k most similar roles pay here $465k

This role pays more than 99% of similar roles. Most pay $182,200–$223,700 — the shaded band above. At the midpoint, this role pays about $352k versus about $203k for comparable roles.

Based on 239 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 966 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 955 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Principal Software Engineer, E2E Performance and Goodput

The Principal Engineer role on the CSP Engagements team focuses on ensuring key cloud service providers achieve their performance targets on NVIDIA platforms. This senior-level position involves driving performance characterization work streams with customer engineering teams to build a shared understanding of platform capabilities and gather workload-specific feedback for NVIDIA’s optimization efforts. Daily tasks include updating open-source performance tools, collaborating on internal and external tooling improvements, conducting cross-CSP performance analysis, and defining test strategies for validation. Ideal candidates have over 15 years of experience in systems performance engineering with expertise in GPU profiling, distributed training dynamics, statistical methods, and data visualization using Python and related tools. They should also possess a deep understanding of the software stack’s impact on performance and be skilled at communicating technical findings to both engineers and executives.

What you'll do

  • Drive performance characterization work streams with CSP engineering teams.
  • Gather and synthesize CSP feedback to identify gaps in expected throughput.
  • Ensure open-source performance tools are updated for NVIDIA's latest systems.
  • Work with CSPs to update their tooling to reflect new GPU capabilities.
  • Conduct cross-CSP performance analysis to identify systemic improvement patterns.
  • Define test strategies and tooling requirements for performance validation.

What we're looking for

  • 15+ years experience in GPU/HPC/ML infrastructure performance engineering.
  • Proficient in GPU workload profiling tools and techniques.
  • Expertise in distributed training performance optimization at large scale.
  • Strong statistical methods for analyzing and visualizing performance data.
  • Deep understanding of software stack's impact on system performance.
  • Experience with NVIDIA platforms, including DGX and HGX systems.
  • Ability to influence multiple engineering teams to prioritize performance improvements.

More like this

Similar roles

Principal Software Engineer, CSP Engagements

Nvidia

Santa Clara, CA 4 days ago $272,000$431,250
Linux CUDA ARM x86 CXL Out-of-Band Management In-band Management System Software Design Performance Analysis Complex System-Level Debugging Test Design Linux Kernel Device Drivers Memory Fabric GPU Computing Deep Learning Workloads

Principal Software Engineer, At-Scale Reliability and Fleet Intelligence

Nvidia

Santa Clara, CA 3 days ago $272,000$431,250
Pareto Weibull time-series databases anomaly detection health scoring event correlation NVIDIA GPU error taxonomy Xid errors NVLink error counters thermal events CPER records predictive failure models fleet reliability MTBF MTBI burn-in testing stress testing certification frameworks hardware health telemetry pipelines

Principal Software Engineer, Performance Tooling

Microsoft

Redmond, WA +1 9 days ago $165,600$296,400
Python C++ PyTorch TensorFlow ONNX Runtime CUDA ROCm Triton Distributed Systems GPU Architecture HPC LLMs Profiling Tools Tracing Tools Observability Tools CI/CD

Principal Software Engineer, Rack-Scale System Software

Nvidia

Remote (Santa Clara, CA) +1 3 days ago $272,000$431,250
NVIDIA_NVSwitch GPU_fabric_management cluster_management fleet_orchestration health_monitoring_APIs telemetry_systems firmware_update_orchestration error_handling_frameworks system_level_orchestration fabric_management_software multi_component_coordination distributed_systems_error_handling health_scoring event_correlation API_design GPU_drivers device_management power_management NVIDIA_NVOS CI/CD
Remote

Principal Software Engineer, Performance

Microsoft

Mountain View, CA 19 days ago $142,800$274,800
Python C++ CUDA ROCm PyTorch TensorFlow ONNX_Runtime NVIDIA_GPUs AMD_GPUs Maia_silicon Performance_Benchmarking GPU_Profiling_Tools CI/CD Azure Linux