Director, Engineering Operations and Site Reliability Engineering, Datacenter Server Systems

Nvidia

Quick summary

Work type
On-site
Location
Santa Clara, CA
Salary
$292,000–$442,750 / yr
Posted
3 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $193k
This role $367k
$115k most similar roles pay here $478k

This role pays more than 98% of similar roles. Most pay $155,900–$230,500 — the shaded band above. At the midpoint, this role pays about $367k versus about $193k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 928 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 916 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Director, Engineering Operations and Site Reliability Engineering, Datacenter Server Systems

NVIDIA seeks a seasoned technology leader for its Engineering Operations and Site Reliability Engineering team, focusing on the reliability and scalability of next-generation datacenter server systems. This technical leadership role involves ensuring internal rack-scale systems are healthy, observable, and highly available through execution excellence in fleet operations, incident response, and automation-first practices. The ideal candidate will lead teams to build automation, telemetry, alerting, and dashboards that enhance visibility and resolve issues efficiently while collaborating with hardware, firmware, software, networking, validation, and infrastructure teams. They must have a strong background in server systems, Linux, cluster operations, high-speed networking, and large-scale infrastructure, along with experience managing complex distributed systems and building cohesive technical teams. This role requires expertise in GPU, AI, HPC, cloud, or hyperscale datacenter infrastructure, as well as the ability to mentor and develop technical leaders who prioritize reliability and execution excellence.

What you'll do

  • Lead teams to ensure internal rack-scale server systems remain available, healthy, and reliable.
  • Drive execution for fleet operations, incident response, roadmap planning, and change management.
  • Build automation and telemetry systems to improve visibility and issue resolution speed.
  • Partner with hardware, firmware, software, networking, validation, and infrastructure teams on complex deployments.
  • Create feedback loops into NPI and sustaining teams to enhance product quality and development velocity.

What we're looking for

  • BS or MS in Computer Science, Electrical Engineering, or related field (or equivalent experience).
  • 12+ years of experience in infrastructure, systems engineering, reliability, datacenter operations, distributed systems, including 7+ years of people management.
  • Strong understanding of server systems, Linux, cluster operations, high-speed networking, and large-scale infrastructure.
  • Proven track record of driving execution across multiple teams with complex system operations experience.
  • Experience building cohesive technical teams and developing leaders in reliability and automation-first practices.
  • Clear communication skills for executive-level reporting on operational health, risks, and priorities.

More like this

Similar roles

Site Reliability Engineer, HPC & Automation

SpaceX

Redmond, WA 1 day ago $125,000$150,000
Python Bash Linux Docker Kubernetes PostgreSQL MySQL Terraform Ansible Puppet Grafana Prometheus Jenkins Slurm NFS REST NetAppONTAP Cadence Synopsys Ansys Siemens CI/CD