HPC Operations Engineer

Nvidia

Hybrid

Quick summary

Work type
Hybrid
Location
Santa Clara, CA
Salary
$124,000–$195,500 / yr
Posted
59 days ago

Market check

Salary context

Below market

How this pay compares to similar roles

Similar $180k
This role $160k
$112k most similar roles pay here $237k

This role pays less than 67% of similar roles. Most pay $144,550–$214,562 — the shaded band above. At the midpoint, this role pays about $160k versus about $180k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 563 open roles on FindRole.

Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · HPC Operations Engineer

As a HPC Operations Engineer on the Hardware Infrastructure Farm team at NVIDIA, you will lead the design and implementation of advanced compute clusters that power all silicon development. Your role involves ensuring high reliability, efficiency, and performance in these clusters while driving foundational improvements through automation to enhance engineer productivity. Day-to-day responsibilities include troubleshooting support requests, enhancing deployment automation, managing OS configurations, and collaborating with specialist teams to optimize infrastructure for chip development. You will work with a diverse team that values intellectual curiosity and problem-solving, using tools like Docker, Python, bash scripting, and Ansible in a Red Hat Linux environment. Additionally, familiarity with job schedulers, high-speed networking, and distributed storage systems is essential for this role, which plays a critical part in advancing NVIDIA’s next-generation chip development process.

What you'll do

  • Troubleshoot incoming support requests in a large-scale HPC environment.
  • Contribute to deployment automation and operational monitoring improvements.
  • Ensure correct OS configuration on compute servers for reliability.
  • Collaborate with specialist teams to resolve complex issues efficiently.
  • Improve chip development process infrastructure utilization through expertise.

What we're looking for

  • Proficient in administering CentOS/RHEL Linux distributions and understanding container technologies like Docker.
  • Excellent problem-solving skills with experience analyzing complex systems and implementing scalable solutions.
  • BS in Computer Science or equivalent degree with 2+ years of relevant post-degree experience.
  • Solid understanding of cluster configuration management tools such as Ansible and key Linux technologies.
  • Proficiency in Python, UNIX scripting languages (bash), Perl for maintaining legacy scripts.
  • Familiarity with job scheduler administration and building/operating large-scale compute infrastructure.
  • Knowledge of high-speed networking and fast distributed storage systems.

More like this

Similar roles

HPC User Support Engineer

Argonne National Laboratory

Remote (Lemont, IL) 10 days ago $69,750$108,810
Python C/C++ FORTRAN UNIX PBSPro Git Jenkins Docker MPI OpenMP PostgreSQL HPC CI/CD
Remote

Senior HPC Cluster Engineer

Nvidia

Santa Clara, CA 87 days ago $152,000$241,500
Slurm Kubernetes Python Bash Docker Enroot Prometheus Grafana Linux RHEL Ubuntu MPI NCCL CUDA NVIDIA_GPUs InfiniBand RDMA RoCE Lustre GPFS Ansible MLPerf

Senior HPC Storage Architect & Engineer

Lam Research

Fremont, CA 144 days ago $114,000$253,000
Lustre GPFS/Spectrum Scale VAST Data WEKA NetApp ONTAP FlexCache AWS Azure GCP InfiniBand RoCE NVMe-over-Fabrics SLURM xCAT Warewulf Ansible Terraform Python YAML Kubernetes CSI S3 IaC CI/CD
Hybrid

Solutions Architect, HPC Systems Engineer

Nvidia

Santa Clara, CA 134 days ago $184,000$287,500
NVIDIA CUDA Docker Kubernetes Linux C/C++ InfiniBand RoCE DPU ARM Ethernet DevOps MLOps Python Go Terraform AWS CI/CD Prometheus Grafana
Hybrid

HPC Systems Administration Specialist

Argonne National Laboratory

Lemont, IL 166 days ago $69,750$108,810
Linux Spack Lmod Singularity Python CI pipelines Make CMake Autotools GCC Intel Compilers LLVM YAML Podman Git

HPC Systems Administration Specialist

Argonne National Laboratory

Lemont, IL 129 days ago $69,750$108,810
Linux Spack Lmod Singularity Version control systems Compilers GCC Intel LLVM Make CMake Autotools Python CI pipelines YAML Podman MPI CUDA BLAS FFTW