Senior Site Reliability Engineer - Observability and Telemetry Platform

Nvidia

Remote

Quick summary

Work type: Remote
Location: Santa Clara, CA
Salary: $168,000–$270,250 / yr
Posted: 19 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $178k

This role $219k

$124k most similar roles pay here $286k

This role pays more than 80% of similar roles. Most pay $145,500–$210,100 — the shaded band above. At the midpoint, this role pays about $219k versus about $178k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 563 open roles on FindRole.

Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior Site Reliability Engineer - Observability and Telemetry Platform

Apply Now Log in to save

As a Senior Site Reliability Engineer at NVIDIA, you will join a specialized team responsible for ensuring the reliability and efficiency of large-scale GPU cloud services. Your day-to-day responsibilities include designing and implementing observability platforms with real-time monitoring, logging, and alerting capabilities, while also engaging in system design consulting and capacity management to support service launches. You will maintain these systems post-launch by continuously measuring performance and health metrics, scaling them sustainably through automation, and practicing blameless incident response. The role requires expertise in Linux, networking, containers, and experience with Kubernetes, OpenStack, Docker, Grafana, Prometheus, and OpenTelemetry. Ideal candidates have a background in computer science or related fields, 8+ years of infrastructure automation and distributed systems design experience, and strong problem-solving and communication skills to thrive in this dynamic environment focused on system reliability and performance optimization.

Skills

Kubernetes Python Go Docker Prometheus Grafana OpenTelemetry Linux Networking Containers CI/CD Terraform AWS Azure Git Ansible PostgreSQL Redis Zabbix Nginx

What you'll do

Design and implement large-scale observability and telemetry platforms focusing on real-time monitoring and alerting.
Engage in the full lifecycle of services from inception to deployment and refinement, ensuring high availability.
Maintain live systems by measuring performance metrics like latency and system health.
Scale production systems sustainably through automation and evolve them for improved reliability and efficiency.
Participate in an on-call rotation to support and respond to incidents in production environments.
Conduct blameless postmortems to identify and mitigate potential outages proactively.

What we're looking for

8+ years of experience in infrastructure automation and distributed systems design
5+ years delivering foundational infrastructure and observability platforms
Proficiency in Python, Go, Perl or Ruby
In-depth knowledge of Linux, networking, and containers
Experience with Kubernetes, OpenStack, Docker, Grafana, Prometheus, and OpenTelemetry
Ability to debug, optimize code, and automate routine tasks
Systematic problem-solving skills and strong communication abilities

Similar roles

Principal Site Reliability Engineer - Observability and Telemetry Platform

Nvidia

Remote (Santa Clara, CA) 14 days ago $248,000–$396,750

Kubernetes Python Go Docker Grafana OpenTelemetry Prometheus Linux Networking Containers CI/CD Terraform AWS Azure Google Cloud Platform PostgreSQL MySQL Ansible SaltStack Bash Git Jenkins

Remote

Save

Senior Site Reliability Engineer

Adobe

San Jose 59 days ago $208,300–$301,600

AWS Kubernetes Terraform Python Go CI/CD Infrastructure as Code Docker PostgreSQL Security hardening AI-enabled platforms Cross-team leadership Developer experience optimization

Save

Site Reliability Engineer - Hardware Infrastructure

Nvidia

Santa Clara, CA 8 days ago $184,000–$287,500

SRE DevOps Python Go Perl Ruby Prometheus Grafana CI/CD Kubernetes AWS Terraform Docker LLM Generative AI Agentic solutions

Save

Senior Site Reliability Engineer

Carta

San Francisco, California 63 days ago $181,688–$213,750

AWS Terraform Python Kubernetes Docker Postgres Prometheus Grafana CI/CD gRPC Ansible ELK Stack Datadog GraphQL

Hybrid

Save

Senior Site Reliability Engineer

Oracle

Nashville, TN 23 days ago $79,100–$158,200

AWS Azure GCP OCI Major Incident Management Agile Terraform Docker CI/CD RESTful APIs Jenkins Chef Ansible Prometheus Grafana Python Go

Save

Senior Site Reliability Engineer

Oracle

US 22 days ago $79,100–$158,200

Oracle Cloud Infrastructure Kubernetes Python Go Bash CI/CD Terraform Prometheus Grafana Linux Networking Docker SRE Incident Response SLIs/SLOs Resilience Engineering FedRAMP 3PAO

Save