Distinguished Site Reliability Engineer - Cloud

Nvidia

Remote Actively hiring Verified listing

Remote, USA Posted 12 days ago $320,000–$488,750 / year

View original post Log in to save

At a glance

AI generated

TL;DR

As an SRE at NVIDIA, you will join a specialized team focused on maintaining high reliability and uptime for both internal and external GPU cloud services. Your day-to-day responsibilities include designing and implementing operational aspects of large-scale Kubernetes clusters, ensuring real-time monitoring, logging, and alerting systems are in place. You’ll engage in the entire lifecycle of services, from initial design to deployment and ongoing maintenance, supporting capacity management and launch reviews. Key tasks involve measuring system health, scaling sustainably through automation, and participating in on-call rotations for production support. The role demands expertise in Linux, networking, containers, and experience with Python, Go, Perl, or Ruby. You’ll work within a culture that values diversity, problem-solving, and continuous improvement, tackling complex challenges in large-scale cloud environments based on Kubernetes and OpenStack.

Skills

Kubernetes Python Go Docker Linux Networking Containers CI/CD Terraform AWS OpenStack Prometheus Grafana PostgreSQL GitOps

What you'll do

Lead the design and implementation of operational aspects for large-scale Kubernetes clusters.
Improve service lifecycle from inception through deployment, operation, and refinement.
Maintain live services by monitoring availability, latency, and system health metrics.
Scale systems sustainably using automation to enhance reliability and velocity.
Participate in on-call rotations to support production systems and conduct blameless postmortems.

What we're looking for

16+ years experience in infrastructure automation and distributed systems design.
BS degree in Computer Science or related technical field with coding emphasis.
Proficiency in Python, Go, Perl, or Ruby for system development.
In-depth knowledge of Linux, networking, and container technologies.
Experience with Kubernetes, OpenStack, and Docker in large-scale cloud environments.
Ability to debug, optimize code, and automate routine tasks efficiently.
Systematic problem-solving skills and strong communication abilities.

Market check

Salary context

This $320,000–$488,750 range sits above 99% of similar postings on FindRole.

Peer median band

$122,550–$210,900

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$142,400–$202,000

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

Similar roles

Site Reliability Engineer - Hardware Infrastructure

Nvidia

Us, Ca, Santa Clara, US 21 days ago $184,000–$287,500

SRE DevOps Python Go Perl Ruby Prometheus Grafana CI/CD Kubernetes AWS Terraform Docker LLM Generative AI Agentic solutions

Site Reliability Engineer

Equifax

Usa - Missouri - St. Louis - Lackland, US 44 days ago

AWS GCP Terraform Jenkins Python Bash Docker Kubernetes CI/CD Prometheus PostgreSQL Linux Windows Ansible Chef

Site Reliability Engineer - Data, Cloud & Developer Experience

Blackstone Inc

New York 601 Lex, US 105 days ago $140,000–$225,000

AWS Terraform Python Docker Grafana Prometheus CI/CD Kubernetes ECS EKS Puppet Gitlab Splunk

Site Reliability Engineer (SRE) - AI Platform & Cloud

Morgan Stanley

Alpharetta Ga 1 Edison, US 38 days ago

Kubernetes AWS Azure Google Cloud Python Docker Terraform Prometheus Grafana REST framework API Gateway CI/CD PostgreSQL Redis Snowflake OpenTelemetry Slurm ModelOps MLOps LLM Op Chaos Engineering

Site Reliability Engineer, Teamcenter, Enterprise Technology Services

Apple Inc

Austin, Texas, US 9 days ago

Python Go Java Splunk Oracle Cassandra SOLR Kafka Linux TLS SSL DNS Load_Balancers Docker Jenkins CI/CD Teamcenter Siemens_Teamcenter

Cloud Infrastructure Engineer

Clover Health

Remote (US) 29 days ago $115,000–$175,000

AWS Kubernetes Docker IaC CI/CD Python Go Shell Script GCP Prometheus Grafana IAM Terraform

Remote