Senior Site Reliability Engineer, AIOPs

Nvidia

Actively hiring
Santa Clara, US Posted 17 days ago $148,000$235,750 / year

At a glance

AI generated

TL;DR

As a DevOps Engineer at our AI Data Center AIOps platform team, you will ensure the reliability and performance of our telemetry ingestion, processing, storage, and API/dashboards. Your daily tasks include monitoring platform health through dashboards and logs, automating checks for Kubernetes deployments, leading incident triage, and maintaining runbooks to improve automation. You’ll work closely with Software Engineering and Systems Engineering teams to translate signals into actionable alerts and drive continuous improvement. Ideal candidates have a BS/MS in CS/CE or equivalent experience, 5+ years of production distributed systems operation as an SRE/DevOps engineer, deep Kubernetes expertise, and solid scripting skills in Python/Bash. You should also be proficient with CI/CD, infrastructure-as-code (Terraform + Helm), observability tools like Prometheus and Grafana, and have a strong background in Linux networking and distributed systems operations.

Skills

Kubernetes Terraform Python Helm CI/CD Docker Prometheus Grafana Bash Linux Networking Apache Kafka Pulsar Flink Spark ClickHouse Elasticsearch TimescaleDB AWS Azure Google Cloud Platform

What you'll do

  • Monitor platform health through dashboards/logs/metrics and automate checks to ensure reliability and efficiency.
  • Manage Kubernetes deployments including runbooks, canary checks, and post-deploy validation with end-to-end ownership.
  • Lead first-level incident triage by collecting diagnostics and identifying root causes for clear handoff to engineering teams.
  • Build and maintain runbooks/SOPs/checklists while pushing continuous improvement through automation practices.
  • Ensure deployment infrastructure scalability and consistency using Helm and Terraform/IaC for reproducible environments.

What we're looking for

  • BS/MS in CS/CE or equivalent experience with 5+ years operating production distributed systems as SRE/DevOps.
  • Proven ownership of reliability for observability/AIOps platforms including SLOs/SLIs and incident response.
  • Deep Kubernetes and container deployment, debugging, and scaling expertise for telemetry-heavy microservices.
  • Automation-first approach with scripting (Python/Bash), CI/CD, and infrastructure-as-code (Terraform + Helm).
  • Strong Linux fundamentals, networking skills, and experience in distributed systems operations.
  • Experience building safe automation including canary releases and automated rollback criteria.
  • Proven programming experience in Python or similar languages for operational tools and services.

Market check

Salary context

This $148,000–$235,750 range sits above 70% of similar postings on FindRole.

Peer median band

$124,563$198,000

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$135,000$200,096

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Principal AIOps Engineer

CVS Health

Remote (Hartford-Farmington Ave Atrium, US) 29 days ago $144,200$288,400
Python ServiceNow Prometheus Grafana OpenTelemetry ELK Splunk Datadog REST webhooks event pipelines runbook automation machine learning LLM-based approaches anomaly detection time-series analysis correlation approaches agentic AI frameworks Linux networking fundamentals TCP/IP DNS TLS load balancing
Remote

Senior Site Reliability Engineer

Adobe

San Jose, US 51 days ago $208,300$301,600
AWS Kubernetes Terraform Python Go CI/CD Infrastructure as Code Docker PostgreSQL Security hardening AI-enabled platforms Cross-team leadership Developer experience optimization

Senior Site Reliability Engineer

CoStar Group

US 11 days ago
AWS Kubernetes Docker Terraform CloudFormation Python Java C# NodeJS Bash PCI compliance REST API Microservices CDN PostgreSQL MySQL Azure Google Cloud CI/CD

Senior Site Reliability Engineer

The Federal Reserve

Boston, Ma, US 45 days ago $140,000$210,900
AWS Terraform Python Go Docker CI/CD Kubernetes EKS RDS Aurora S3 Route53 ELB IAM Consul Vault Ansible Linux Shell Scripting CloudWatch OpenSearch Grafana Prometheus

Senior Site Reliability Engineer

Carta

Seattle, Washington, US 55 days ago $181,688$213,750
AWS Terraform Python Kubernetes Docker Postgres Prometheus Grafana CI/CD gRPC Ansible ELK Stack Datadog GraphQL

Senior Site Reliability Engineer

Oracle

Reston, Virginia, US 20 days ago
Oracle Linux Ansible Terraform Python Bash Prometheus Grafana Kubernetes CI/CD Git Active Directory LDAP Kerberos GlusterFS PostgreSQL Docker AWS Azure Google Cloud Platform Nginx Apache HTTP Server