Senior Technical Program Manager, DGX Cloud Software Products and Services

Nvidia

Actively hiring Verified listing
Santa Clara, US Posted 25 days ago $168,000$258,750 / year

At a glance

AI generated

TL;DR

NVIDIA's DGX Cloud team is seeking an experienced Technical Program Manager (IC5) to lead strategic programs focused on enhancing resilience, reliability, and operational scale for AI workloads. This role involves collaborating with cross-functional teams including engineering, SRE, operations, and researchers to develop scalable resilience strategies, improve service stability, and build modular software components. The TPM will drive the adoption of resilience reference stacks and operational standards while improving observability and failure detection mechanisms. Key responsibilities include defining metrics for goodput and fleet-wide performance, and working with cloud infrastructure and distributed systems to ensure high-availability training environments at scale. Ideal candidates have an MS in EE or CS, 8+ years of program management experience in complex technical projects, strong analytical skills, and proficiency in tools like Jira and Git. Experience with AI infrastructure, large-scale compute platforms, and distributed training frameworks is essential for this role that demands a deep understanding of reliability engineering and service performance metrics.

Skills

Jira Aha! Confluence Git Distributed version control systems Reliability engineering Resilience development Service performance metrics Goodput Efficiency Utilization Distributed training frameworks Checkpointing NCCL Slurm AI infrastructure Large-scale compute platforms CI/CD

What you'll do

  • Lead cross-functional programs to enhance resilience and reliability across DGX Cloud infrastructure.
  • Identify systemic risks and resolve dependencies to improve end-to-end service stability.
  • Drive adoption of resilience reference stacks and operational standards for service readiness.
  • Develop open, modular software components with engineering teams for scalable resilience.
  • Build tooling to improve observability and root cause analysis in failure scenarios.
  • Define metrics and dashboards to track program health and reliability posture continuously.
  • Improve fleet-wide goodput using data-driven insights to enhance customer outcomes at scale.

What we're looking for

  • 8+ years of program management experience in large-scale software or infrastructure projects
  • Proven track record leading complex cross-functional programs in cloud or platform environments
  • Strong analytical skills to assess issues across infrastructure, software, and operational layers
  • Solid understanding of reliability engineering, resilience development, and service performance metrics
  • Experience working with engineering, SRE, operations, and technical collaborators in ambiguous environments
  • Outstanding communication and presentation skills for diverse audiences with strong problem-solving abilities
  • Background in computer science, machine learning, deep learning, or GPU technology

Market check

Salary context

This $168,000–$258,750 range sits above 62% of similar postings on FindRole.

Peer median band

$140,270$234,875

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$151,475$235,750

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Senior Manager, DGX Cloud Technical Program Management

Nvidia

Us, Ca, Santa Clara, US 25 days ago $240,000$379,500
Grafana Prometheus Kubernetes AWS Azure CI/CD Docker Python PostgreSQL Terraform GitLab Jenkins Ansible NVIDIA GPU AI/ML platforms observability telemetry cloud infrastructure distributed systems security compliance

Senior Technical Program Manager, DGX Cloud - Trust Services

Nvidia

Us, Ca, Santa Clara, US 25 days ago $200,000$322,000
Jira Confluence CI/CD GPU Firmware Security Confidential Computing Device Trust Hardware/Software Trust Models Cloud Platforms Automation Telemetry Dashboards PostgreSQL Kubernetes AWS Azure Grafana Prometheus

Senior Technical Program Manager, Cloud Infrastructure

Nvidia

Us, Ca, Santa Clara, US 21 days ago $200,000$322,000
Jira Kubernetes Terraform API integration Python CI/CD Prometheus Grafana NVIDIA GPU products AWS Azure Google Cloud Platform PostgreSQL Docker Git Scrum Agile methodologies

Senior Technical Program Manager, Cloud Infrastructure

Nvidia

Us, Ca, Santa Clara, US 28 days ago $168,000$258,750
Jira Kubernetes Terraform API integration CI/CD NVIDIA GPU products Cloud Service Providers PostgreSQL Python Docker AWS Azure Grafana Prometheus Scrum DevOps