Senior Site Reliability Engineer - HPC

Nvidia

Quick summary

Work type: On-site
Location: Santa Clara, CA · Austin, TX · Durham, NC
Salary: $152,000–$241,500 / yr
Posted: 101 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $183k

This role $197k

$126k most similar roles pay here $254k

This role pays more than 68% of similar roles. Most pay $149,920–$216,250 — the shaded band above. At the midpoint, this role pays about $197k versus about $183k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 563 open roles on FindRole.

Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior Site Reliability Engineer - HPC

Apply Now Log in to save

Join NVIDIA’s Compute Farm team as a Senior SRE to drive the development of next-generation global services platforms. You will own end-to-end solutions, integrating them seamlessly with HPC schedulers and multi-cloud environments, ensuring high uptime through robust operational practices. Utilize IaC and configuration management for standardized automation across on-premises and cloud infrastructures, while collaborating closely with cross-functional teams to deliver innovative projects efficiently. Essential skills include experience with large-scale HPC clusters like Slurm or Kubernetes, modern CI/CD techniques, and proficiency in Python, Go, Perl, or Ruby. Additionally, you should have a track record of mentoring engineers, contributing to technical documentation, and maintaining open-source components at scale. This role demands expertise in monitoring tools, container management, and data-driven operations to tackle complex reliability challenges in a fast-paced environment.

Skills

AWS GCP OCI Kubernetes Slurm LSF CI/CD Terraform Python Go Perl Ruby Prometheus Grafana Docker Ansible GitOps AIOps PostgreSQL MySQL

What you'll do

Own SRE solutions from design to continuous improvement, integrating with HPC schedulers and network fabrics.
Use IaC and configuration management to standardize and automate provisioning across multi-cloud environments.
Design systems for failure with redundancy, progressive delivery, and strict change control measures.
Ensure high uptime and Quality of Service (QoS) through operational excellence and capacity planning.
Detect performance issues and recommend solutions to maintain world-class service quality.
Participate in on-call rotations, incident reviews, and root cause analysis to improve system reliability.

What we're looking for

B.S. degree in Computer Science or equivalent experience with 5+ years of professional SRE experience.
Experience supporting large-scale HPC clusters using Slurm, LSF, or Kubernetes.
Proficiency in modern CI/CD techniques and Infrastructure as Code (IaC) for service management.
Strong background in designing large-scale infrastructure platforms for automated host lifecycle management.
Expertise in monitoring tools, metrics, container management, and log collection systems.
5+ years of coding/scripting experience in Python, Go, Perl, or Ruby.
Published technical write-ups or given talks on reliability, observability, or HPC/SRE solutions.

Similar roles

Senior Site Reliability Engineer

Adobe

San Jose 59 days ago $208,300–$301,600

AWS Kubernetes Terraform Python Go CI/CD Infrastructure as Code Docker PostgreSQL Security hardening AI-enabled platforms Cross-team leadership Developer experience optimization

Save

Senior Site Reliability Engineer

Carta

San Francisco, California 63 days ago $181,688–$213,750

AWS Terraform Python Kubernetes Docker Postgres Prometheus Grafana CI/CD gRPC Ansible ELK Stack Datadog GraphQL

Hybrid

Save

Senior Site Reliability Engineer

Oracle

Nashville, TN 23 days ago $79,100–$158,200

AWS Azure GCP OCI Major Incident Management Agile Terraform Docker CI/CD RESTful APIs Jenkins Chef Ansible Prometheus Grafana Python Go

Save

Senior Site Reliability Engineer

Oracle

US 22 days ago $79,100–$158,200

Oracle Cloud Infrastructure Kubernetes Python Go Bash CI/CD Terraform Prometheus Grafana Linux Networking Docker SRE Incident Response SLIs/SLOs Resilience Engineering FedRAMP 3PAO

Save

Senior Site Reliability Engineer

The Federal Reserve

Boston, MA 10 days ago $140,000–$210,900

AWS Terraform Python Docker EKS RDS Aurora S3 Route53 ELB IAM CloudWatch OpenSearch Grafana Prometheus CI/CD Kubernetes Ansible Linux Shell scripting EC2 EBS Observability

Save

Senior Site Reliability Engineer

Anduril Industries

Costa Mesa, CA 2 days ago $166,000–$220,000

Linux Python Terraform Kubernetes Docker Ansible Networking Security CI/CD Monitoring Splunk AWS Azure GCP

Save