Senior HPC Cluster Engineer

Nvidia

Actively hiring
Us, Ca, Santa Clara, US Posted 78 days ago $152,000$241,500 / year

At a glance

AI generated

TL;DR

Join our engineering team as an HPC Cluster Engineer, where you will design, deploy, and manage GPU Compute Clusters for EDA and high-performance computing workloads. Your day-to-day responsibilities include developing scalable automation solutions, improving infrastructure provisioning through automation, and providing technical leadership for large-scale HPC systems. You will collaborate with researchers to optimize their EDA workloads and build innovative tooling to enhance performance at scale. The ideal candidate has a Bachelor’s degree in Computer Science or Electrical Engineering and 5+ years of experience with cluster configuration management tools like BCM or Ansible, AI/HPC job schedulers such as Slurm, and container technologies including Enroot and Docker. Proficiency in Python, Bash, and Linux distributions is essential, along with expertise in analyzing and tuning performance for EDA workloads. Familiarity with NVIDIA GPUs, CUDA Programming, and high-speed networking technologies like InfiniBand and RDMA is a plus.

Skills

Slurm Kubernetes Python Bash Docker Enroot Prometheus Grafana Linux RHEL Ubuntu MPI NCCL CUDA NVIDIA_GPUs InfiniBand RDMA RoCE Lustre GPFS Ansible MLPerf

What you'll do

  • Develop and enhance GPU-accelerated computing ecosystem with scalable automation solutions.
  • Improve infrastructure provisioning, management, and observability through continuous automation.
  • Provide technical leadership for managing large-scale HPC systems, including deployment of compute resources.
  • Support researchers in running EDA workloads by conducting performance analysis and optimizations.
  • Conduct root cause analysis to suggest corrective actions and proactively fix issues.
  • Build innovative tooling to accelerate researchers' velocity and enhance software performance at scale.

What we're looking for

  • Minimum 5 years experience crafting and operating large-scale compute infrastructure.
  • Expertise with AI/HPC job schedulers and orchestrators such as Slurm or K8s.
  • Proficiency in Linux distributions like Rocky/Centos/RHEL, Ubuntu, Docker, Enroot.
  • Strong skills in Python and Bash scripting for automation and management tasks.
  • Experience analyzing and tuning performance for EDA workloads and complex systems.
  • Excellent communication and collaboration skills across various teams and individuals.
  • Passion for continual learning and staying ahead of new technologies in HPC fields.

Market check

Salary context

This $152,000–$241,500 range sits above 62% of similar postings on FindRole.

Peer median band

$148,000$235,175

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$142,400$235,750

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 802 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 798 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Senior HPC Performance Engineer - AI for Science at Scale

Nvidia

Us, Ca, Santa Clara, US 100 days ago $184,000$287,500
CUDA Python C++ PyTorch JAX Warp HPC Distributed Learning Atomistic Modeling CI/CD Git Linux NVIDIA DGX Systems GPU Programming Parallel Computing Data Structures Algorithm Design Machine Learning Frameworks Scientific AI Codebases Computational Chemistry Digital Biology

Senior HPC Performance Engineer

Nvidia

Remote (Us, Or, Remote, US) 41 days ago $184,000$287,500
Fortran C C++ OpenACC OpenMP MPI CUDA Performance_analysis Parallel_programming Linear_algebra Numerical_methods Assembly_language Debugging Porting
Remote

Senior HPC Storage Engineer

Nvidia

Us, Ca, Santa Clara, US 67 days ago $184,000$287,500
Python Docker Ceph Weka.io Vast Lustre GPFS CUDA NCCL PyTorch TensorFlow Bash CentOS RHEL Ubuntu SDN MLPerf NVIDIA GPUs HDDs SSDs NVMe

Senior HPC Storage Architect & Engineer

Lam Research

Fremont, Ca,Us, US 135 days ago $114,000$253,000
Lustre GPFS/Spectrum Scale VAST Data WEKA NetApp ONTAP FlexCache AWS Azure GCP InfiniBand RoCE NVMe-over-Fabrics SLURM xCAT Warewulf Ansible Terraform Python YAML Kubernetes CSI S3 IaC CI/CD

Senior HPC and LSF Operations Engineer

Nvidia

Us, Ca, Santa Clara, US 78 days ago $152,000$241,500
LSF Slurm Linux CentOS RHEL Docker Singularity Podman HPC Reliability Engineering Metrics Collection Monitoring Pipelines Alerting Strategies Performance Dashboards Container Technologies Job Scheduling Systems

Senior HPC Solutions Architect

Nvidia

Remote (Us, Ca, Santa Clara, US) 48 days ago $184,000$287,500
Python C++ CUDA SLURM Linux BMC PCIe Network_Adapters InfiniBand DPU RoCE ARM Linux_Kernel Drivers SDN C
Remote