Senior Site Reliability Engineer - HPC

Nvidia

Quick summary

Work type
On-site
Location
Santa Clara, CA · Austin, TX · Durham, NC
Salary
$152,000–$241,500 / yr
Posted
101 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $183k
This role $197k
$126k most similar roles pay here $254k

This role pays more than 68% of similar roles. Most pay $149,920–$216,250 — the shaded band above. At the midpoint, this role pays about $197k versus about $183k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 563 open roles on FindRole.

Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior Site Reliability Engineer - HPC

Join NVIDIA’s Compute Farm team as a Senior SRE to drive the development of next-generation global services platforms. You will own end-to-end solutions, integrating them seamlessly with HPC schedulers and multi-cloud environments, ensuring high uptime through robust operational practices. Utilize IaC and configuration management for standardized automation across on-premises and cloud infrastructures, while collaborating closely with cross-functional teams to deliver innovative projects efficiently. Essential skills include experience with large-scale HPC clusters like Slurm or Kubernetes, modern CI/CD techniques, and proficiency in Python, Go, Perl, or Ruby. Additionally, you should have a track record of mentoring engineers, contributing to technical documentation, and maintaining open-source components at scale. This role demands expertise in monitoring tools, container management, and data-driven operations to tackle complex reliability challenges in a fast-paced environment.

What you'll do

  • Own SRE solutions from design to continuous improvement, integrating with HPC schedulers and network fabrics.
  • Use IaC and configuration management to standardize and automate provisioning across multi-cloud environments.
  • Design systems for failure with redundancy, progressive delivery, and strict change control measures.
  • Ensure high uptime and Quality of Service (QoS) through operational excellence and capacity planning.
  • Detect performance issues and recommend solutions to maintain world-class service quality.
  • Participate in on-call rotations, incident reviews, and root cause analysis to improve system reliability.

What we're looking for

  • B.S. degree in Computer Science or equivalent experience with 5+ years of professional SRE experience.
  • Experience supporting large-scale HPC clusters using Slurm, LSF, or Kubernetes.
  • Proficiency in modern CI/CD techniques and Infrastructure as Code (IaC) for service management.
  • Strong background in designing large-scale infrastructure platforms for automated host lifecycle management.
  • Expertise in monitoring tools, metrics, container management, and log collection systems.
  • 5+ years of coding/scripting experience in Python, Go, Perl, or Ruby.
  • Published technical write-ups or given talks on reliability, observability, or HPC/SRE solutions.

More like this

Similar roles

Senior Site Reliability Engineer

Adobe

San Jose 59 days ago $208,300$301,600
AWS Kubernetes Terraform Python Go CI/CD Infrastructure as Code Docker PostgreSQL Security hardening AI-enabled platforms Cross-team leadership Developer experience optimization

Senior Site Reliability Engineer

Carta

San Francisco, California 63 days ago $181,688$213,750
AWS Terraform Python Kubernetes Docker Postgres Prometheus Grafana CI/CD gRPC Ansible ELK Stack Datadog GraphQL
Hybrid

Senior Site Reliability Engineer

Oracle

Nashville, TN 23 days ago $79,100$158,200
AWS Azure GCP OCI Major Incident Management Agile Terraform Docker CI/CD RESTful APIs Jenkins Chef Ansible Prometheus Grafana Python Go

Senior Site Reliability Engineer

Oracle

US 22 days ago $79,100$158,200
Oracle Cloud Infrastructure Kubernetes Python Go Bash CI/CD Terraform Prometheus Grafana Linux Networking Docker SRE Incident Response SLIs/SLOs Resilience Engineering FedRAMP 3PAO

Senior Site Reliability Engineer

The Federal Reserve

Boston, MA 10 days ago $140,000$210,900
AWS Terraform Python Docker EKS RDS Aurora S3 Route53 ELB IAM CloudWatch OpenSearch Grafana Prometheus CI/CD Kubernetes Ansible Linux Shell scripting EC2 EBS Observability

Senior Site Reliability Engineer

Anduril Industries

Costa Mesa, CA 2 days ago $166,000$220,000
Linux Python Terraform Kubernetes Docker Ansible Networking Security CI/CD Monitoring Splunk AWS Azure GCP