Senior Platform and EngOps Engineer - Cluster Operations

Nvidia

Actively hiring
Santa Clara, US Posted 15 days ago $176,000$276,000 / year

At a glance

AI generated

TL;DR

As an EngOps or Platform Engineer at our High Performance Computing team, you will develop automated tools to deploy and maintain extensive GPU clusters interconnected via NVLink and InfiniBand, ensuring seamless operations through software updates and monitoring. You’ll manage daily cluster failures, coordinate with cross-functional teams across time zones, and roll out critical software and firmware updates while minimizing disruptions. The ideal candidate has 8+ years of experience in deploying and administering large-scale infrastructure, expertise in Ansible, Python, and Shell Scripting, and a deep understanding of operating systems and high-performance applications. Familiarity with resource scheduling managers like Slurm, GPU-focused hardware such as DGX systems, and emergency response practices is essential for this role that demands proficiency in Linux fundamentals and robust metrics collection infrastructure design.

Skills

Python Ansible Shell Linux Slurm Prometheus Grafana Docker Kubernetes Terraform AWS CI/CD NVLink InfiniBand PostgreSQL Git Jenkins

What you'll do

  • Develop automated tools to deploy and maintain extensive GPU clusters interconnected via NVLink and InfiniBand.
  • Implement DevOps tools to automate software updates and monitor cluster availability for seamless operations.
  • Troubleshoot daily cluster failures promptly to ensure optimal performance and minimal downtime.
  • Manage the rollout and rollback of cluster software and firmware updates with minimal disruptions.
  • Design robust metrics collection and alerting infrastructure for efficient monitoring and response.

What we're looking for

  • BS or MS in Computer Science or related field with 8+ years of cluster administration experience.
  • Expertise in automation tools like Ansible, Python scripting, and shell scripting.
  • Deep knowledge of operating systems, computer networks, and high-performance applications.
  • Experience managing large GPU clusters interconnected via NVLink and InfiniBand.
  • Proficiency in Linux fundamentals, resource scheduling managers (e.g., Slurm), and alerting tools.
  • Ability to collaborate effectively with cross-functional teams across multiple time zones.

Market check

Salary context

This $176,000–$276,000 range sits above 85% of similar postings on FindRole.

Peer median band

$125,000$200,250

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$126,912$200,312

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Senior Engineer - Partner Platform

GEICO

Remote (Ca Palo Alto Office, US) 53 days ago $105,000$215,000
Java C# .NET Python Docker Kubernetes Azure AWS GCP SQL NoSQL RESTful Web Services Angular React HTML-5 JavaScript TypeScript XML JSON GraphQL CI/CD
Remote

Senior Platform Engineer

Arm Holdings

Austin, Texas, US 10 days ago $161,500$218,500
Kubernetes Terraform Python Go CI/CD GitOps MCP Model Gateway RAG Systems LLM Observability Service Mesh Policy-as-Code Workload Identity Sandboxing Secure Runtime Environments Multi-Tenant Platform Designs Linux Cloud Infrastructure-as-Code Incident Management Demand Forecasting Production Readiness Practices Security Fundamentals Identity Secrets Management Access Control Network Segmentation Vulnerability Management Audit Logging

Senior Full-Stack Engineer, Enterprise GenAI

Adobe

San Jose, US 71 days ago $208,300$301,600
JavaScript React Node.js REST APIs CI/CD AWS Kubernetes Docker Python PostgreSQL Git Swagger/OpenAPI OAuth JWT GraphQL Redis MongoDB Nginx GCP Azure Adobe I/O Runtime Adobe Experience Platform

Kubernetes Platform Engineer (IT Engineer Senior)

Qualcomm

San Diego, Ca,Us, US 30 days ago
Kubernetes Rancher RKE2 GKE EKS AKS Cilium Docker ContainerD git Github Python Go bash JIRA CKAD CKA CKS Portworx MetalLB Github Actions CI/CD

Senior Platform Engineering Lead

Citi

6400 Las Colinas Blvd Irving, US 11 days ago $138,720$208,080
Openshift API engineering CI/CD Python Java JavaScript SQL PostgreSQL Docker Kubernetes AWS Git Jenkins Prometheus Grafana DevOps Agile Scrum

Senior Atlassian Engineer

Leidos

1887 Alexandria Va, US 17 days ago $107,900$195,050
Atlassian Jira Confluence Bitbucket DevSecOps CI/CD Python Groovy REST APIs Agile SAFe AWS Azure GCP ServiceNow AI ML Docker Kubernetes