Principal Software Engineer, DGX Cloud Production Engineering

Nvidia

Remote Actively hiring Verified listing
Remote, US · Santa Clara, CA Posted 11 days ago $272,000$431,250 / year

At a glance

AI generated

TL;DR

NVIDIA’s DGX Cloud team seeks Principal Software Engineers to lead technical direction in Kubernetes-based operations, automation, and reliability for large-scale GPU clusters across internal and cloud partner environments. This senior role involves defining the architecture for cluster lifecycle management, validation, repair, upgrades, observability, and readiness, while establishing patterns for Kubernetes-based GPU cluster operations. Key responsibilities include reducing operational overhead through software and automation, setting technical standards for production readiness, mentoring engineers, and influencing cross-functional teams. Ideal candidates have over 15 years of experience in building and operating large-scale distributed systems or cloud infrastructure, with expertise in Kubernetes, Linux, Go, Python, and production operations. Experience with GPU clusters, AI/ML infrastructure, GitOps, and multi-cloud fleet operations is a plus.

Skills

Kubernetes Go Python GitOps Linux Docker Terraform CI/CD Prometheus Grafana PostgreSQL AWS Azure Google Cloud Platform GPU AI ML SLOs observability incident response automation BMaaS VMaaS

What you'll do

  • Define and execute technical strategy for DGX Cloud cluster operations.
  • Lead design and implementation of systems for GPU cluster lifecycle management.
  • Establish Kubernetes-based operational patterns across diverse environments.
  • Identify and automate processes to reduce operational overhead in large-scale clusters.
  • Set technical standards for production readiness, SLOs, and incident response.

What we're looking for

  • 15+ years experience in building and operating large-scale distributed systems or cloud infrastructure.
  • Deep expertise with Kubernetes, Linux, infrastructure automation, and production operations.
  • Strong programming skills in Go, Python, or similar languages.
  • Proven leadership in complex cross-organizational technical initiatives.
  • Experience designing reliable systems with clear SLOs, observability, incident response, and automation.
  • BS/MS in Computer Science or equivalent practical experience.

Market check

Salary context

This $272,000–$431,250 range sits above 100% of similar postings on FindRole.

Peer median band

$138,060$226,000

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$151,875$215,850

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Principal Software Engineer - DGX Cloud

Nvidia

Us, Ca, Santa Clara, US 30 days ago $272,000$431,250
Python Kubernetes Go AWS Prometheus Grafana OpenTelemetry Docker CI/CD Java CUDA cuDNN

Engineering Manager, DGX Cloud Production Engineering

Nvidia

Remote (Us, Ca, Remote, US) 11 days ago $224,000$356,500
Kubernetes GitOps CI/CD Docker Terraform AWS GCP Azure Prometheus Grafana Python Go Bash PostgreSQL Redis GitHub Jenkins Ansible Nagios Zabbix
Remote

Principal Software Engineer - Compute Infrastructure

Nvidia

Remote (Us, Ca, Santa Clara, US) 16 days ago $248,000$391,000
Kubernetes OpenShift Terraform Go Python GitOps ArgoCD AWS GCP NFSv4 NVMe/TCP Hyperconverged storage CI/CD Microservices Self-service architecture SLAs
Remote