Senior AI and ML HPC Cluster Engineer

Nvidia

Remote

Quick summary

Work type
Remote
Location
Santa Clara, CA · Austin, TX
Salary
$152,000–$241,500 / yr
Posted
43 days ago

Market check

Salary context

Competitive pay

How this pay compares to similar roles

Similar $215k
This role $197k
$139k most similar roles pay here $273k

This role pays less than 65% of similar roles. Most pay $184,612–$246,150 — the shaded band above. At the midpoint, this role pays about $197k versus about $215k for comparable roles.

Based on 239 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 563 open roles on FindRole.

Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior AI and ML HPC Cluster Engineer

Join our GPU AI/HPC Infrastructure team as a senior technical leader responsible for designing and implementing cutting-edge GPU compute clusters that handle demanding deep learning, high-performance computing, and intensive computational tasks. You will lead the strategic challenges of compute, networking, and storage design in large-scale environments, optimize resource utilization in heterogeneous settings, and evolve our private/public cloud strategy. Your daily work involves deploying and managing HPC systems, developing scalable automation solutions for GPU-accelerated computing, building AI and ML clusters both on-premises and in the cloud, and supporting researchers with performance analysis and optimizations. Essential skills include experience with advanced job schedulers like Slurm or K8s, proficiency in Linux distributions such as CentOS/RHEL and Ubuntu, cluster configuration tools like Ansible, container technologies including Docker and Singularity, Python programming, bash scripting, MPI workflows, and familiarity with NVIDIA GPUs, CUDA, NCCL, MLPerf, InfiniBand, distributed storage systems, and deep learning frameworks.

What you'll do

  • Lead the design and implementation of GPU compute clusters for demanding workloads.
  • Develop scalable automation solutions to enhance GPU-accelerated computing ecosystems.
  • Build and maintain AI/ML heterogeneous clusters both on-premises and in the cloud.
  • Conduct performance analysis and optimizations for researchers' workloads.
  • Proactively identify and resolve issues before they impact system operations.

What we're looking for

  • Minimum 5+ years experience designing and operating large-scale compute infrastructure.
  • Expertise in AI/HPC job schedulers (Slurm, K8s, PBS) and container technologies (Docker, Singularity).
  • Proficiency in Linux administration (CentOS/RHEL, Ubuntu), Python programming, and bash scripting.
  • Solid understanding of cluster configuration management tools like Ansible, Puppet, Salt.
  • Experience analyzing and tuning performance for AI/HPC workloads using MPI and deep learning frameworks.

More like this

Similar roles

Senior HPC Performance Engineer - AI for Science at Scale

Nvidia

Santa Clara, CA 109 days ago $184,000$287,500
CUDA Python C++ PyTorch JAX Warp HPC Distributed Learning Atomistic Modeling CI/CD Git Linux NVIDIA DGX Systems GPU Programming Parallel Computing Data Structures Algorithm Design Machine Learning Frameworks Scientific AI Codebases Computational Chemistry Digital Biology

Senior HPC Cluster Engineer

Nvidia

Santa Clara, CA 87 days ago $152,000$241,500
Slurm Kubernetes Python Bash Docker Enroot Prometheus Grafana Linux RHEL Ubuntu MPI NCCL CUDA NVIDIA_GPUs InfiniBand RDMA RoCE Lustre GPFS Ansible MLPerf

Senior AI Compute Engineer - NVIS

Nvidia

Remote (Santa Clara, CA) 54 days ago $148,000$235,750
Linux Bash Python Ansible SLURM LSF UGE Kubernetes HPL NCCL MLPerf InfiniBand MPI Lustre GPFS BCM Terraform CI/CD
Remote

Senior AI/ML Engineer

General Motors (GM)

Remote (Mountain View, CA) 4 days ago $170,600$261,300
Python Transformers Generative_AI Multimodal_Systems AutoML Quantization Model_Distillation Architecture_Search CVPR ICML NeurIPS IJCAI KDD Robotics_Conference_Papers AV_ADAS_Experience
Remote Hybrid