Senior Site Reliability Engineer

Nvidia

Quick summary

Work type
On-site
Location
Santa Clara, CA
Salary
$148,000–$235,750 / yr
Posted
7 days ago

Market check

Salary context

Competitive pay

How this pay compares to similar roles

Similar $170k
This role $192k
$128k most similar roles pay here $247k

This role pays more than 62% of similar roles. Most pay $139,831–$200,000 — the shaded band above. At the midpoint, this role pays about $192k versus about $170k for comparable roles.

Based on 239 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 980 open roles on FindRole.

Listed pay typically runs $168,000–$270,250 across 966 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior Site Reliability Engineer

NVIDIA seeks a Senior Site Reliability Engineer (SRE) to join its dynamic Infrastructure, Planning and Processes team, focusing on maintaining and enhancing the company’s internal Jenkins-based CI/CD pipeline for GPUs and Tegra systems. This role involves managing on-prem infrastructure across multiple data centers, ensuring high availability and reliability through robust monitoring and incident response protocols. Key responsibilities include deploying and configuring Kubernetes clusters, implementing logging and alerting solutions like Prometheus and Grafana, and driving automation to optimize system health and performance. The ideal candidate has extensive experience with cloud infrastructure, BMC interfaces, OpenStack, databases, networking principles, data analytics tools, and automation technologies such as Jenkins and Ansible. Knowledge of NVIDIA hardware and a strong background in Kubernetes, Docker, and virtualization is essential for this role that demands exceptional problem-solving skills and the ability to thrive in a fast-paced environment.

What you'll do

  • Manage on-prem infrastructure to ensure uptime and reliability across multiple data centers.
  • Implement monitoring and alerting systems to maintain service level agreements (SLAs).
  • Deploy and manage applications on Kubernetes clusters with logging and monitoring solutions.
  • Participate in capacity planning and optimization efforts for efficient resource utilization.
  • Resolve user-reported issues, monitor alerts, and participate in incident response activities.
  • Drive automation of monitoring to gain deeper insights into application and system health.
  • Use AI techniques to extract useful signals from data generated by machines and jobs.

What we're looking for

  • Experience maintaining cloud infrastructure and highly-available production environments.
  • Hands-on proficiency with BMC interfaces (Redfish), KVM, IPMI tools, and OpenStack architecture.
  • Solid understanding of networking principles and protocols, including TCP/IP, DNS, DHCP, VLANs.
  • Practical experience with data analytics and visualization tools like Kibana, Grafana, Splunk.
  • Proficiency in automation tools (Jenkins/Temporal) and configuration management tools (Ansible).
  • Advanced knowledge of Kubernetes, Docker, and virtualization technologies for production environments.
  • 5+ years of demonstrable experience in a relevant technical role.

More like this

Similar roles

Senior Site Reliability Engineer

Adobe

San Jose 69 days ago $208,300$301,600
AWS Kubernetes Terraform Python Go CI/CD Infrastructure as Code Docker PostgreSQL Security hardening AI-enabled platforms Cross-team leadership Developer experience optimization

Senior Site Reliability Engineer

Carta

San Francisco, California +2 73 days ago $181,688$213,750
AWS Terraform Python Kubernetes Docker Postgres Prometheus Grafana CI/CD gRPC Ansible ELK Stack Datadog GraphQL
Hybrid

Senior Site Reliability Engineer

Oracle

Reston, VA +2 38 days ago
Oracle Linux Ansible Terraform Python Bash Prometheus Grafana GlusterFS Active Directory LDAP Kerberos CI/CD PostgreSQL Docker Kubernetes Git Jenkins

Senior Site Reliability Engineer

Oracle

Nashville, TN +1 33 days ago $79,100$158,200
AWS Azure GCP OCI Major Incident Management Agile Terraform Docker CI/CD RESTful APIs Jenkins Chef Ansible Prometheus Grafana Python Go

Senior Site Reliability Engineer

The Federal Reserve

Boston, MA 20 days ago $140,000$210,900
AWS Terraform Python Docker EKS RDS Aurora S3 Route53 ELB IAM CloudWatch OpenSearch Grafana Prometheus CI/CD Kubernetes Ansible Linux Shell scripting EC2 EBS Observability

Senior Site Reliability Engineer

Anduril Industries

Costa Mesa, CA 12 days ago $166,000$220,000
Linux Python Terraform Kubernetes Docker Ansible Networking Security CI/CD Monitoring Splunk AWS Azure GCP