Site Reliability Engineer - Hardware Infrastructure

Nvidia

Actively hiring
Santa Clara, US Posted 21 days ago $184,000$287,500 / year

At a glance

AI generated

TL;DR

At NVIDIA, the Site Reliability Engineer role involves collaborating within a supportive team to maintain large-scale production systems, focusing on incident management, root cause analysis, and developing reliable monitoring solutions. You will define metrics for system reliability, automate routine tasks with Generative AI, and guide teams in implementing efficient operational standards. The ideal candidate has over 8 years of experience in SRE or DevOps, a strong grasp of SRE principles, and hands-on experience with automation tools and observability platforms like Prometheus and Grafana. Additionally, proficiency in Python, Go, Perl, or Ruby is required, along with the ability to communicate technical concepts effectively across diverse teams.

Skills

SRE DevOps Python Go Perl Ruby Prometheus Grafana CI/CD Kubernetes AWS Terraform Docker LLM Generative AI Agentic solutions

What you'll do

  • Develop and support guidelines for incident management and blameless postmortems.
  • Assist teams in responding to severe incidents by driving root cause analysis and corrective actions.
  • Define reliability metrics, Service Level Objectives, and error budgets for efficient system function.
  • Drive adoption of customer-centric monitoring and alerting systems to enhance service quality.
  • Apply automation and Generative AI solutions to minimize manual tasks and boost support efficiency.

What we're looking for

  • 8+ years of experience in SRE, DevOps, or Production Engineering.
  • Strong understanding of SRE principles including incident management, error budgets, SLOs, and SLAs.
  • Experience crafting and deploying fault-tolerant, performant, and supportable systems.
  • Hands-on experience with observability platforms like Prometheus and Grafana.
  • Expertise in automation and Generative AI/Agentic solutions for minimizing manual tasks.
  • Background in running critical services in production with tight SLAs.

Market check

Salary context

This $184,000–$287,500 range sits above 89% of similar postings on FindRole.

Peer median band

$127,332$200,000

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$135,625$203,300

Middle half of comparable postings.

Based on 239 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Site Reliability Engineer

Booz Allen Hamilton

Locations Herndon, Virginia, US 32 days ago $86,800$198,000
Java Spring Boot CI/CD Agile Bitbucket GitLab Kubernetes NiFi Kafka MongoDB Elasticsearch ArgoCD

Site Reliability Engineer |||

CME Group

Chicago - 20 S. Wacker, US 115 days ago $100,700$167,800
GCP Docker Kubernetes Python Java Oracle Postgres BigQuery SLO SLI SLA OpenTelemetry Splunk Prometheus Grafana CI/CD Bamboo JIRA Git

Site Reliability Engineer

The Walt Disney Company

Remote (Usa - Fl - Disney'S Hollywood Studios - Feature Animation Building, US) 50 days ago
Akamai Splunk AppDynamics GitHub Ansible Chef AWS Azure GCP CI/CD RESTful APIs Microservices Cloud computing Python JavaScript Kubernetes Terraform Prometheus Grafana
Remote

Site Reliability Engineer

Equifax

Usa - Missouri - St. Louis - Lackland, US 44 days ago
AWS GCP Terraform Jenkins Python Bash Docker Kubernetes CI/CD Prometheus PostgreSQL Linux Windows Ansible Chef

Site Reliability Engineer

Shopify

US 28 days ago
Kubernetes Docker CI/CD Python Go PostgreSQL AWS GCP Prometheus Grafana Terraform GitOps