Site Reliability Engineer - Hardware Infrastructure
Nvidia
At a glance
AI generatedAs an SRE at NVIDIA, you will join a specialized team focused on maintaining high reliability and uptime for both internal and external GPU cloud services. Your day-to-day responsibilities include designing and implementing operational aspects of large-scale Kubernetes clusters, ensuring real-time monitoring, logging, and alerting systems are in place. You’ll engage in the entire lifecycle of services, from initial design to deployment and ongoing maintenance, supporting capacity management and launch reviews. Key tasks involve measuring system health, scaling sustainably through automation, and participating in on-call rotations for production support. The role demands expertise in Linux, networking, containers, and experience with Python, Go, Perl, or Ruby. You’ll work within a culture that values diversity, problem-solving, and continuous improvement, tackling complex challenges in large-scale cloud environments based on Kubernetes and OpenStack.
Skills
What you'll do
What we're looking for
Market check
This $320,000–$488,750 range sits above 99% of similar postings on FindRole.
Peer median band
$122,550–$210,900
Median floor and ceiling across peers.
Typical midpoint (25–75%)
$142,400–$202,000
Middle half of comparable postings.
Based on 240 comparable postings.
* 240 is the maximum number of comparable postings sampled.
Employer
Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing
Nvidia currently has 801 open roles on FindRole.
Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.
Most-posted roles
More like this
Nvidia
Equifax
Blackstone Inc
Morgan Stanley
Apple Inc
Clover Health