Director, Engineering Operations and Site Reliability Engineering, Datacenter Server Systems

Nvidia

Quick summary

Work type: On-site
Location: Santa Clara, CA
Salary: $292,000–$442,750 / yr
Posted: 3 days ago
Nearby: 99+ roles within 25 mi

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $193k

This role $367k

$115k most similar roles pay here $478k

This role pays more than 98% of similar roles. Most pay $155,900–$230,500 — the shaded band above. At the midpoint, this role pays about $367k versus about $193k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 928 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 916 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Director, Engineering Operations and Site Reliability Engineering, Datacenter Server Systems

Role Posting Log in to save

NVIDIA seeks a seasoned technology leader for its Engineering Operations and Site Reliability Engineering team, focusing on the reliability and scalability of next-generation datacenter server systems. This technical leadership role involves ensuring internal rack-scale systems are healthy, observable, and highly available through execution excellence in fleet operations, incident response, and automation-first practices. The ideal candidate will lead teams to build automation, telemetry, alerting, and dashboards that enhance visibility and resolve issues efficiently while collaborating with hardware, firmware, software, networking, validation, and infrastructure teams. They must have a strong background in server systems, Linux, cluster operations, high-speed networking, and large-scale infrastructure, along with experience managing complex distributed systems and building cohesive technical teams. This role requires expertise in GPU, AI, HPC, cloud, or hyperscale datacenter infrastructure, as well as the ability to mentor and develop technical leaders who prioritize reliability and execution excellence.

Skills

Linux Kubernetes Docker Terraform AWS CI/CD Prometheus Grafana PostgreSQL Python Go Rack-scale systems High-speed networking GPU AI HPC Cloud infrastructure Hyperscale datacenter Server management Networking Storage Power Thermal RAS concepts Automation-first execution

What you'll do

Lead teams to ensure internal rack-scale server systems remain available, healthy, and reliable.
Drive execution for fleet operations, incident response, roadmap planning, and change management.
Build automation and telemetry systems to improve visibility and issue resolution speed.
Partner with hardware, firmware, software, networking, validation, and infrastructure teams on complex deployments.
Create feedback loops into NPI and sustaining teams to enhance product quality and development velocity.

What we're looking for

BS or MS in Computer Science, Electrical Engineering, or related field (or equivalent experience).
12+ years of experience in infrastructure, systems engineering, reliability, datacenter operations, distributed systems, including 7+ years of people management.
Strong understanding of server systems, Linux, cluster operations, high-speed networking, and large-scale infrastructure.
Proven track record of driving execution across multiple teams with complex system operations experience.
Experience building cohesive technical teams and developing leaders in reliability and automation-first practices.
Clear communication skills for executive-level reporting on operational health, risks, and priorities.

Similar roles

Senior Datacenter Technical Program Manager, At-Scale AI Clusters

Nvidia

Remote (Santa Clara, CA) 25 days ago $168,000–$258,750

Prometheus Grafana Splunk Modbus BACNet Kubernetes Terraform AWS PostgreSQL CI/CD Python Docker High-Performance Computing GPU Clusters Datacenter Design Power and Cooling Technologies

Remote

Save

Senior Director of Site Reliability Engineering

JPMorgan Chase

Palo Alto, CA 6 days ago $232,750–$325,000

AI Python Kubernetes CI/CD PostgreSQL AWS Docker Prometheus Grafana DevOps Scrum Agile Git Linux JSON/WebAPI

Save

Site Reliability Engineer, HPC & Automation

SpaceX

Redmond, WA 1 day ago $125,000–$150,000

Python Bash Linux Docker Kubernetes PostgreSQL MySQL Terraform Ansible Puppet Grafana Prometheus Jenkins Slurm NFS REST NetAppONTAP Cadence Synopsys Ansys Siemens CI/CD

Save

Site Reliability Engineer, Enterprise Technology Services

Apple Inc

Sunnyvale, CA 65 days ago $150,400–$277,600

Java Python Bash Oracle MongoDB Prometheus Splunk Grafana Linux Git CI/CD Kubernetes AWS GCP Nginx Envoy TLS SSL DNS LoadBalancers NetworkSecurity Cryptography

Save

Site Reliability Engineer, Enterprise Technology Services

Apple Inc

Sunnyvale, CA 49 days ago $184,700–$277,600

Python Java Bash Oracle MongoDB Prometheus Splunk Grafana Linux Git CI/CD Kubernetes AWS GCP Nginx Envoy TLS SSL DNS LoadBalancers WebMethods NetScaler SRE SLA SLO SLI

Save

Site Reliability Engineer, Enterprise Technology Services

Apple Inc

Sunnyvale, CA 90 days ago $216,200–$324,800

Java Python Go Prometheus Grafana CI/CD Kafka RabbitMQ Git Helm OpenTelemetry Docker AWS Kubernetes Terraform ISO-27001 PCI DevOps MLOps

Save