Senior Systems Software Engineer, Data Center Infrastructure Management - EngOps

Nvidia

Remote Actively hiring
Remote · Austin, TX · Santa Clara, CA Posted 32 days ago $152,000$241,500 / year

At a glance

AI generated

TL;DR

As an EngOps Engineer at our advanced infrastructure software team, you will take ownership of daily cluster failures and issues, ensuring optimal performance and availability. You will collaborate closely with developers to support deployment and debug hardware and Infrastructure Manager solutions in datacenter environments. Key responsibilities include managing updates to site controller management nodes, rolling out and rolling back cluster software and firmware updates, and deploying services in Kubernetes. Ideal candidates have 5+ years of experience in deploying and administering clusters, servers, switches, and related infrastructure, along with expertise in hardware management protocols like Redfish and IPMI, and familiarity with OpenStack and Foreman. This role requires a deep understanding of server, rack, and network topologies, as well as proficiency in scripting for automation and observability tools such as Grafana.

Skills

Kubernetes Grafana Redfish IPMI BMC OpenStack Foreman Python Shell scripting Docker CI/CD Terraform AWS PostgreSQL Prometheus NVIDIA DGX systems High Performance Computing Deep Learning GPU management

What you'll do

  • Troubleshoot and resolve daily cluster failures to ensure optimal performance.
  • Manage updates for site controller management nodes efficiently.
  • Roll out and rollback software and firmware updates to minimize disruptions.
  • Collaborate with Infrastructure Service team on deployment and debugging support.
  • Configure and debug complex data center networks for high performance.
  • Develop scripts to automate recovery actions for management controllers.

What we're looking for

  • 5+ years experience deploying and administering clusters, servers, switches.
  • BS or MS in Computer Science/Engineering or equivalent practical experience.
  • Experience with Kubernetes deployment and complex data center networks.
  • Proficiency in hardware management protocols (Redfish, IPMI, BMC).
  • Expertise in GPU-focused hardware and software solutions like DGX systems.
  • Understanding of server, rack, and network topologies and their interactions.

Market check

Salary context

This $152,000–$241,500 range sits above 67% of similar postings on FindRole.

Peer median band

$140,000$234,700

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$149,750$219,200

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Senior Software Engineer - Datacenter Systems

Nvidia

Remote (Us, Ca, Santa Clara, US) 9 days ago $184,000$287,500
Python Rust C++ Shell Kubernetes Jenkins GitLab Ansible GitOps Prometheus Grafana CI/CD Linux Slurm NVIDIA DGX systems Docker Terraform AWS Azure Google Cloud Platform
Remote

Senior Software Architect - Data Center Systems

Nvidia

Remote (Us, Ca, Santa Clara, US) 15 days ago $224,000$356,500
Deep Learning HPC Redfish IPMI MCTP PLDM RDE Kubernetes Docker CI/CD AWS Azure Google Cloud Platform PostgreSQL Python C++ Networking Technologies Storage Technologies Terraform Prometheus Grafana
Remote

Senior Cloud infrastructure Engineer

Abbott

US 35 days ago $78,000$156,000
Microsoft Azure Kubernetes GitOps Helm Flux CD Agile methodologies Infrastructure as Code (IaC) GitHub Actions Docker CI/CD Azure Kubernetes Service (AKS) ADF Storage SFTP Prometheus Grafana Shell scripting GitHub Jenkins Artifactory Jira Confluence