Director, Site Reliability and Software Engineering - DGX Cloud

Nvidia

Remote Actively hiring

Remote, US · Santa Clara, CA Posted 19 days ago $320,000–$488,750 / year

View original post Log in to save

At a glance

AI generated

TL;DR

As a Site Reliability and Software Engineering leader in NVIDIA's DGXC Cloud Reliability organization, you will manage a team of engineers responsible for the software, automation, and operations of multi-colo distributed GPU cloud clusters. Your role involves contributing to product strategy, growing your team, and ensuring operational excellence through scalable SDLC practices and modern methodologies. You will work closely with project management teams to drive technical projects and provide leadership in an innovative environment, focusing on delivering reliable systems both internally and externally. The ideal candidate has over 12 years of engineering experience, including at least five years in leadership roles, with expertise in designing large-scale distributed systems and managing DevOps teams. Strong knowledge in Unix/Linux, containerization, virtualization, and cluster solutions is essential, along with the ability to influence cross-functional partners and mentor team members effectively.

Skills

Kubernetes Docker CI/CD Unix/Linux Python PostgreSQL AWS GCP Azure Prometheus Grafana Terraform GitLab Jenkins

What you'll do

Manage a team of Software and Site Reliability engineers, overseeing program development and code reviews.
Define and execute the team’s strategic roadmap for scalable SDLC practices in NVIDIA’s DGX Cloud Computing environment.
Lead technical projects and drive operational excellence in an innovative and fast-paced engineering setting.
Collaborate with project management teams to ensure high-quality product development processes.
Interact with key stakeholders to provide financial clarity on technical spend and support executive reporting initiatives.

What we're looking for

Over 12 years of engineering management experience with at least 5 years in leadership roles.
Bachelor's or Master's degree in Computer Science or equivalent practical experience.
Proven expertise in designing and implementing large-scale distributed systems.
Strong background in Unix/Linux environments, containers/virtualization, and cluster solutions.
Demonstrated ability to mentor and coach team members effectively.
Experience managing technical support/DevOps teams and delivering projects under tight deadlines.
Ability to influence cross-functional groups and establish strong relationships with various IT functional teams.

Market check

Salary context

This $320,000–$488,750 range sits above 99% of similar postings on FindRole.

Peer median band

$162,900–$257,250

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$170,000–$244,000

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

Similar roles

Director, Software Engineering (Site Reliability Engineering)

Affirm

Remote (US) 31 days ago $300,000–$360,000

Kubernetes Docker CI/CD Python PostgreSQL AWS Azure Google Cloud Platform Terraform Prometheus Grafana

Remote

Site Reliability Engineer - Data, Cloud & Developer Experience

Blackstone Inc

New York 601 Lex, US 105 days ago $140,000–$225,000

AWS Terraform Python Docker Grafana Prometheus CI/CD Kubernetes ECS EKS Puppet Gitlab Splunk

Lead Director, Site Reliability Engineering - Client Experience

CVS Health

Remote (Richardson-1300 E Campbell Rd, US) 15 days ago $144,200–$288,400

Azure GCP Kubernetes CI/CD SLOs SLIs Terraform Docker Prometheus Grafana PostgreSQL Python Go AWS OpenShift AI‑Ops observability microservices APIs

Remote

Director, Site Reliability Engineering

McDonald’s Corporation

Chicago, Illinois, US 28 days ago $178,121–$222,651

AWS Azure GCP Site Reliability Engineering Agile Methodologies CI/CD Vendor Management Cloud Infrastructure PaaS IaaS Data Analytics Financial Forecasting Chargeback Management Global Vendor Relationships High-Performance Team Building

Distinguished Site Reliability Engineer - Cloud

Nvidia

Remote (Us, Wa, Remote, US) 12 days ago $320,000–$488,750

Kubernetes Python Go Docker Linux Networking Containers CI/CD Terraform AWS OpenStack Prometheus Grafana PostgreSQL GitOps

Remote

Principal Software Engineer - DGX Cloud

Nvidia

Us, Ca, Santa Clara, US 30 days ago $272,000–$431,250

Python Kubernetes Go AWS Prometheus Grafana OpenTelemetry Docker CI/CD Java CUDA cuDNN