Principal Site Reliability Engineer - Observability and Telemetry Platform

Nvidia

Remote Actively hiring Verified listing
Remote, USA · Santa Clara, CA Posted 11 days ago $248,000$396,750 / year

At a glance

AI generated

TL;DR

Join NVIDIA’s Site Reliability Engineering (SRE) team as a senior specialist responsible for ensuring the reliability and uptime of GPU cloud services. You will design, implement, and support large-scale observability and telemetry platforms, focusing on real-time monitoring, logging, and alerting. Your day-to-day involves engaging in all stages of service lifecycle management, from initial design to deployment and ongoing maintenance, while also practicing sustainable incident response and conducting blameless postmortems. Key skills include extensive experience with infrastructure automation, distributed systems, and cloud platforms like Kubernetes and OpenStack, along with proficiency in Python, Go, Perl, or Ruby, and deep knowledge of Linux, networking, and containers. This role demands a systematic problem-solving approach and the ability to automate routine tasks, contributing to the continuous improvement of production systems at scale.

Skills

Kubernetes Python Go Docker Grafana OpenTelemetry Prometheus Linux Networking Containers CI/CD Terraform AWS Azure Google Cloud Platform PostgreSQL MySQL Ansible SaltStack Bash Git Jenkins

What you'll do

  • Design and implement large-scale Observability & Telemetry platforms focusing on real-time monitoring and alerting.
  • Engage in the full lifecycle of services from inception to deployment and refinement.
  • Maintain live services by measuring availability, latency, and system health.
  • Scale systems sustainably through automation and improve reliability and velocity.
  • Participate in on-call rotations to support production systems.
  • Practice sustainable incident response and conduct blameless postmortems.

What we're looking for

  • BS degree in Computer Science or related technical field with coding experience
  • 15+ years of infrastructure automation and distributed systems design experience
  • 8+ years delivering foundational infrastructure and observability platforms
  • Proficiency in Python, Go, Perl, or Ruby
  • In-depth knowledge of Linux, networking, and containers
  • Experience using Kubernetes, OpenStack, Docker, Grafana, Prometheus, and OpenTelemetry
  • Strong problem-solving skills and ability to debug/optimize code and automate tasks

Market check

Salary context

This $248,000–$396,750 range sits above 98% of similar postings on FindRole.

Peer median band

$127,666$199,750

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$137,000$196,750

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Site Reliability Engineer

Equifax

Usa - Missouri - St. Louis - Lackland, US 44 days ago
AWS GCP Terraform Jenkins Python Bash Docker Kubernetes CI/CD Prometheus PostgreSQL Linux Windows Ansible Chef

Principal Site Reliability Engineer, Infrastructure Observability

T. Rowe Price

Owings Mills, Md - Building 3, US 71 days ago $159,000$272,000
AWS Python PostgreSQL CI/CD Prometheus Grafana Terraform Ansible New Relic SolarWinds DPA Elastic Stack Splunk DevOps SRE Chaos Engineering SQL Server Node.js .Net Core Java Go

Principal Site Reliability Engineer

The Walt Disney Company

Remote (Usa - Fl - Disney'S Hollywood Studios - Feature Animation Building, US) 49 days ago
AWS Azure GCP Terraform CloudFormation Ansible Chef CI/CD Docker Kubernetes Prometheus Grafana Python Linux Windows AI LLM PCI DevOps SRE SLI SLO SLA
Remote

Principal Site Reliability Engineer

The Walt Disney Company

Remote (Usa - Fl - Disney'S Hollywood Studios - Feature Animation Building, US) 42 days ago
Akamai Kona Site Defender WAF Bot Manager DevOps CI/CD Python Go Docker Terraform AWS Azure Google Cloud PostgreSQL MongoDB Redis Prometheus Grafana Kubernetes Ansible Jenkins GitLab GitHub
Remote

Sr Principal Site Reliability Engineer

The Walt Disney Company

Remote (Usa - Ca - Market St, US) 52 days ago $250,500$335,900
Kubernetes AWS CI/CD Docker Prometheus Grafana Python PostgreSQL Terraform Ansible GitOps CDN integration media streaming technologies content delivery strategies
Remote