Senior Site Reliability Engineer, GeForce NOW

Nvidia

Remote

Quick summary

Work type
Remote
Location
Santa Clara, CA
Salary
$168,000–$270,250 / yr
Posted
4 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $170k
This role $219k
$103k most similar roles pay here $288k

This role pays more than 81% of similar roles. Most pay $139,100–$200,727 — the shaded band above. At the midpoint, this role pays about $219k versus about $170k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 855 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 843 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior Site Reliability Engineer, GeForce NOW

NVIDIA seeks a Senior Site Reliability Engineer to join its GeForce Now team, focusing on ensuring high reliability and uptime for GPU cloud gaming services. This role involves building tools to enhance SRE observability, migrating systems to Kubernetes with VMI setup, and rapidly debugging incidents while automating daily tasks. The ideal candidate will have extensive experience in large-scale distributed microservices environments, strong Kubernetes skills, and proficiency in monitoring systems like Datadog and Prometheus. They should also manage multi-region cloud deployments on hyperscalers, design deployment pipelines using GitHub Actions or GitLab CI, and possess coding skills in Go, Python, or Bash. The position requires a deep understanding of production-grade automation and tooling, as well as experience with AI tools for anomaly detection and LLM-assisted debugging.

What you'll do

  • Develop tools to enhance SRE observability for better system monitoring.
  • Migrate systems to Kubernetes, setting up VMI configurations and solving related issues.
  • Debug and triage incidents and user-reported issues promptly and effectively.
  • Automate daily tasks by scripting and tooling new/existing processes for the team.
  • Support services pre-launch through design consulting, capacity management, and reviews.
  • Participate in on-call rotations to handle production system alerts and service degradations.

What we're looking for

  • 8+ years of site reliability engineering experience with large-scale distributed microservices.
  • Strong Kubernetes background, including complex VMI setup on K8s.
  • Experience managing multi-region cloud deployments on AWS, GCP, or Azure.
  • Proficiency in production-grade coding with Go, Python, or Bash scripting.
  • Expertise in monitoring systems like Datadog, Prometheus, and Alertmanager.
  • Leadership in production improvements through change management and automation.
  • Production on-call experience responding to high-severity infrastructure alerts.

More like this

Similar roles