Engineering Manager, DGX Cloud Production Engineering

Nvidia

Remote Actively hiring Verified listing
Remote, US · CA · TX · WA Posted 11 days ago $224,000$356,500 / year

At a glance

AI generated

TL;DR

As an Engineering Manager at NVIDIA DGX Cloud, you will lead a team responsible for building and operating GPU infrastructure across various environments. Your day-to-day responsibilities include driving execution in Kubernetes operability, automation, GitOps, observability, and incident response while partnering with cross-functional teams to enhance production readiness. You will define priorities, roadmap, staffing, and operational ownership, coaching engineers and fostering a culture of learning and improvement. Ideal candidates have 8+ years of industry experience, including 2+ years in leadership roles, with expertise in reliability engineering, automation, observability, and Kubernetes environments. Strong communication skills and the ability to influence without direct authority are crucial, as is a background in GPU infrastructure and multi-cloud operations.

Skills

Kubernetes GitOps CI/CD Docker Terraform AWS GCP Azure Prometheus Grafana Python Go Bash PostgreSQL Redis GitHub Jenkins Ansible Nagios Zabbix

What you'll do

  • Lead a team building and operating DGX Cloud infrastructure in various environments.
  • Drive execution for cluster operations, Kubernetes operability, automation, and observability.
  • Define team priorities, roadmap, staffing, and operational ownership.
  • Partner with cross-functional teams to enhance production readiness and reliability.
  • Build an on-call culture focused on learning, ownership, and durable incident fixes.

What we're looking for

  • 8+ years industry experience including leading engineering teams.
  • Proven track record in building and operating production infrastructure and Kubernetes environments.
  • Strong understanding of reliability engineering, automation, observability, and incident response.
  • Ability to influence cross-functional teams without direct authority.
  • Clear communication skills with a focus on prioritization and sound judgment.
  • BS/MS in Computer Science or equivalent experience required.
  • Experience leading SRE, production engineering, or platform teams preferred.

Market check

Salary context

This $224,000–$356,500 range sits above 93% of similar postings on FindRole.

Peer median band

$137,200$225,000

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$142,400$224,650

Middle half of comparable postings.

Based on 239 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Principal Software Engineer, DGX Cloud Production Engineering

Nvidia

Remote (Us, Ca, Santa Clara, US) 11 days ago $272,000$431,250
Kubernetes Go Python GitOps Linux Docker Terraform CI/CD Prometheus Grafana PostgreSQL AWS Azure Google Cloud Platform GPU AI ML SLOs observability incident response automation BMaaS VMaaS
Remote

Senior Manager, DGX Cloud Technical Program Management

Nvidia

Us, Ca, Santa Clara, US 25 days ago $240,000$379,500
Grafana Prometheus Kubernetes AWS Azure CI/CD Docker Python PostgreSQL Terraform GitLab Jenkins Ansible NVIDIA GPU AI/ML platforms observability telemetry cloud infrastructure distributed systems security compliance

Principal Software Engineer - DGX Cloud

Nvidia

Us, Ca, Santa Clara, US 30 days ago $272,000$431,250
Python Kubernetes Go AWS Prometheus Grafana OpenTelemetry Docker CI/CD Java CUDA cuDNN

Manager, Cloud Engineering

The OCC

US 45 days ago $136,200$188,200
Terraform Kubernetes Jenkins GitHub Puppet Chef Ansible AWS Azure Python Ruby Go Java Docker CI/CD PostgreSQL CIS NIST Agile Scrum

Senior Production Engineer - DGX Cloud

Nvidia

Remote (Us, Ca, Remote, US) 11 days ago $168,000$270,250
Kubernetes Python Go Docker CI/CD Prometheus Grafana Terraform AWS Azure Slurm Bright_Cluster_Manager PostgreSQL Redis Git Jenkins Ansible Zabbix Nagios Fluentd
Remote