Senior Production Engineer - DGX Cloud

Nvidia

Remote Actively hiring Verified listing
Remote, US Posted 11 days ago $168,000$270,250 / year

At a glance

AI generated

TL;DR

NVIDIA seeks a Senior Production Engineer to join its DGX Cloud team, responsible for scaling GPU clusters for AI workloads. This role involves developing custom software for GPU asset provisioning and lifecycle management across cloud providers, implementing monitoring systems for reliability and scalability, and collaborating with cross-functional teams to ensure optimal performance of production AI clusters. Ideal candidates have extensive experience in Production Engineering or DevOps roles, a strong background in system programming languages like Go or Python, and deep expertise in managing large-scale distributed systems such as Kubernetes and Slurm. The position requires a BS in Computer Science or related field, 8+ years of relevant experience, and proven ability to maintain reliable AI infrastructure at scale.

Skills

Kubernetes Python Go Docker CI/CD Prometheus Grafana Terraform AWS Azure Slurm Bright_Cluster_Manager PostgreSQL Redis Git Jenkins Ansible Zabbix Nagios Fluentd

What you'll do

  • Design and implement custom software for GPU asset provisioning and lifecycle management across cloud providers.
  • Develop monitoring and health management capabilities to ensure high reliability and scalability of GPU clusters.
  • Evaluate system failures using a well-defined incident management process and improve services accordingly.
  • Work with cross-functional teams to maintain reliable and consistent performance in production AI clusters.
  • Automate large-scale distributed systems and manage cluster management tools like Kubernetes and Slurm.

What we're looking for

  • Significant experience in Production Engineering/DevOps/SRE roles with large-scale systems.
  • Demonstrated ability to implement monitoring and health management for GPU clusters.
  • Strong background in managing and automating distributed systems across cloud providers.
  • Deep understanding of cluster management systems like Kubernetes, Slurm, Bright Cluster Manager.
  • Proven track record of operational excellence in maintaining reliable AI infrastructure.
  • BS in Computer Science, Engineering, Physics, Mathematics or equivalent technical degree.

Market check

Salary context

This $168,000–$270,250 range sits above 77% of similar postings on FindRole.

Peer median band

$135,500$213,480

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$142,400$217,725

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Senior Full Stack Software Engineer - DGX Cloud

Nvidia

Remote (Us, Nc, Remote, US) 11 days ago $224,000$356,500
React TypeScript JavaScript Golang PostgreSQL Kubernetes SQL CI/CD Bazel Temporal Slurm Docker Prometheus Git Linux Python GraphQL
Remote

Engineering Manager, DGX Cloud Production Engineering

Nvidia

Remote (Us, Ca, Remote, US) 11 days ago $224,000$356,500
Kubernetes GitOps CI/CD Docker Terraform AWS GCP Azure Prometheus Grafana Python Go Bash PostgreSQL Redis GitHub Jenkins Ansible Nagios Zabbix
Remote

Principal Software Engineer, DGX Cloud Production Engineering

Nvidia

Remote (Us, Ca, Santa Clara, US) 11 days ago $272,000$431,250
Kubernetes Go Python GitOps Linux Docker Terraform CI/CD Prometheus Grafana PostgreSQL AWS Azure Google Cloud Platform GPU AI ML SLOs observability incident response automation BMaaS VMaaS
Remote

Senior Production Support Engineer

CIBC

Il-70 W Madison St, 10Th Fl, US 21 days ago $115,000$150,900
SAS Viya PostgreSQL Linux Microsoft Azure RHEL SQL Cloud Computing ITIL AutoSys ServiceNow CI/CD Python Azure CLI Jira Confluence