HPC Operations Engineer

Nvidia

Hybrid

Quick summary

Work type: Hybrid
Location: Santa Clara, CA
Salary: $124,000–$195,500 / yr
Posted: 59 days ago

Market check

Salary context

Below market

How this pay compares to similar roles

Similar $180k

This role $160k

$112k most similar roles pay here $237k

This role pays less than 67% of similar roles. Most pay $144,550–$214,562 — the shaded band above. At the midpoint, this role pays about $160k versus about $180k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 563 open roles on FindRole.

Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · HPC Operations Engineer

Apply Now Log in to save

As a HPC Operations Engineer on the Hardware Infrastructure Farm team at NVIDIA, you will lead the design and implementation of advanced compute clusters that power all silicon development. Your role involves ensuring high reliability, efficiency, and performance in these clusters while driving foundational improvements through automation to enhance engineer productivity. Day-to-day responsibilities include troubleshooting support requests, enhancing deployment automation, managing OS configurations, and collaborating with specialist teams to optimize infrastructure for chip development. You will work with a diverse team that values intellectual curiosity and problem-solving, using tools like Docker, Python, bash scripting, and Ansible in a Red Hat Linux environment. Additionally, familiarity with job schedulers, high-speed networking, and distributed storage systems is essential for this role, which plays a critical part in advancing NVIDIA’s next-generation chip development process.

Skills

Centos RHEL Docker Python bash Ansible NFS LDAP DNS TCP/IP SLURM FlexLM Perl InfiniBand RDMA RoCE Lustre GPFS

What you'll do

Troubleshoot incoming support requests in a large-scale HPC environment.
Contribute to deployment automation and operational monitoring improvements.
Ensure correct OS configuration on compute servers for reliability.
Collaborate with specialist teams to resolve complex issues efficiently.
Improve chip development process infrastructure utilization through expertise.

What we're looking for

Proficient in administering CentOS/RHEL Linux distributions and understanding container technologies like Docker.
Excellent problem-solving skills with experience analyzing complex systems and implementing scalable solutions.
BS in Computer Science or equivalent degree with 2+ years of relevant post-degree experience.
Solid understanding of cluster configuration management tools such as Ansible and key Linux technologies.
Proficiency in Python, UNIX scripting languages (bash), Perl for maintaining legacy scripts.
Familiarity with job scheduler administration and building/operating large-scale compute infrastructure.
Knowledge of high-speed networking and fast distributed storage systems.

Similar roles

HPC User Support Engineer

Argonne National Laboratory

Remote (Lemont, IL) 10 days ago $69,750–$108,810

Python C/C++ FORTRAN UNIX PBSPro Git Jenkins Docker MPI OpenMP PostgreSQL HPC CI/CD

Remote

Save

Senior HPC Cluster Engineer

Nvidia

Santa Clara, CA 87 days ago $152,000–$241,500

Slurm Kubernetes Python Bash Docker Enroot Prometheus Grafana Linux RHEL Ubuntu MPI NCCL CUDA NVIDIA_GPUs InfiniBand RDMA RoCE Lustre GPFS Ansible MLPerf

Save

Senior HPC Storage Architect & Engineer

Lam Research

Fremont, CA 144 days ago $114,000–$253,000

Lustre GPFS/Spectrum Scale VAST Data WEKA NetApp ONTAP FlexCache AWS Azure GCP InfiniBand RoCE NVMe-over-Fabrics SLURM xCAT Warewulf Ansible Terraform Python YAML Kubernetes CSI S3 IaC CI/CD

Hybrid

Save