Principal Site Reliability Engineer - Observability and Telemetry Platform
Nvidia
Quick summary
Market check
How this pay compares to similar roles
This role pays more than 80% of similar roles. Most pay $145,500–$210,100 — the shaded band above. At the midpoint, this role pays about $219k versus about $178k for comparable roles.
Based on 240 similar postings.
Employer
Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing
Nvidia currently has 563 open roles on FindRole.
Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.
Most-posted roles
At a glance
As a Senior Site Reliability Engineer at NVIDIA, you will join a specialized team responsible for ensuring the reliability and efficiency of large-scale GPU cloud services. Your day-to-day responsibilities include designing and implementing observability platforms with real-time monitoring, logging, and alerting capabilities, while also engaging in system design consulting and capacity management to support service launches. You will maintain these systems post-launch by continuously measuring performance and health metrics, scaling them sustainably through automation, and practicing blameless incident response. The role requires expertise in Linux, networking, containers, and experience with Kubernetes, OpenStack, Docker, Grafana, Prometheus, and OpenTelemetry. Ideal candidates have a background in computer science or related fields, 8+ years of infrastructure automation and distributed systems design experience, and strong problem-solving and communication skills to thrive in this dynamic environment focused on system reliability and performance optimization.
Skills
What you'll do
What we're looking for
More like this
Nvidia
Adobe
Nvidia
Carta
Oracle
Oracle