Senior Site Reliability Engineer - Observability and Telemetry Platform

Nvidia

Remote

Quick summary

Work type
Remote
Location
Santa Clara, CA
Salary
$168,000–$270,250 / yr
Posted
19 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $178k
This role $219k
$124k most similar roles pay here $286k

This role pays more than 80% of similar roles. Most pay $145,500–$210,100 — the shaded band above. At the midpoint, this role pays about $219k versus about $178k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 563 open roles on FindRole.

Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior Site Reliability Engineer - Observability and Telemetry Platform

As a Senior Site Reliability Engineer at NVIDIA, you will join a specialized team responsible for ensuring the reliability and efficiency of large-scale GPU cloud services. Your day-to-day responsibilities include designing and implementing observability platforms with real-time monitoring, logging, and alerting capabilities, while also engaging in system design consulting and capacity management to support service launches. You will maintain these systems post-launch by continuously measuring performance and health metrics, scaling them sustainably through automation, and practicing blameless incident response. The role requires expertise in Linux, networking, containers, and experience with Kubernetes, OpenStack, Docker, Grafana, Prometheus, and OpenTelemetry. Ideal candidates have a background in computer science or related fields, 8+ years of infrastructure automation and distributed systems design experience, and strong problem-solving and communication skills to thrive in this dynamic environment focused on system reliability and performance optimization.

What you'll do

  • Design and implement large-scale observability and telemetry platforms focusing on real-time monitoring and alerting.
  • Engage in the full lifecycle of services from inception to deployment and refinement, ensuring high availability.
  • Maintain live systems by measuring performance metrics like latency and system health.
  • Scale production systems sustainably through automation and evolve them for improved reliability and efficiency.
  • Participate in an on-call rotation to support and respond to incidents in production environments.
  • Conduct blameless postmortems to identify and mitigate potential outages proactively.

What we're looking for

  • 8+ years of experience in infrastructure automation and distributed systems design
  • 5+ years delivering foundational infrastructure and observability platforms
  • Proficiency in Python, Go, Perl or Ruby
  • In-depth knowledge of Linux, networking, and containers
  • Experience with Kubernetes, OpenStack, Docker, Grafana, Prometheus, and OpenTelemetry
  • Ability to debug, optimize code, and automate routine tasks
  • Systematic problem-solving skills and strong communication abilities

More like this

Similar roles

Senior Site Reliability Engineer

Adobe

San Jose 59 days ago $208,300$301,600
AWS Kubernetes Terraform Python Go CI/CD Infrastructure as Code Docker PostgreSQL Security hardening AI-enabled platforms Cross-team leadership Developer experience optimization

Senior Site Reliability Engineer

Carta

San Francisco, California 63 days ago $181,688$213,750
AWS Terraform Python Kubernetes Docker Postgres Prometheus Grafana CI/CD gRPC Ansible ELK Stack Datadog GraphQL
Hybrid

Senior Site Reliability Engineer

Oracle

Nashville, TN 23 days ago $79,100$158,200
AWS Azure GCP OCI Major Incident Management Agile Terraform Docker CI/CD RESTful APIs Jenkins Chef Ansible Prometheus Grafana Python Go

Senior Site Reliability Engineer

Oracle

US 22 days ago $79,100$158,200
Oracle Cloud Infrastructure Kubernetes Python Go Bash CI/CD Terraform Prometheus Grafana Linux Networking Docker SRE Incident Response SLIs/SLOs Resilience Engineering FedRAMP 3PAO