Senior AI and ML HPC Cluster Engineer

Nvidia

Remote

Quick summary

Work type: Remote
Location: Santa Clara, CA · Austin, TX
Salary: $152,000–$241,500 / yr
Posted: 43 days ago

Market check

Salary context

Competitive pay

How this pay compares to similar roles

Similar $215k

This role $197k

$139k most similar roles pay here $273k

This role pays less than 65% of similar roles. Most pay $184,612–$246,150 — the shaded band above. At the midpoint, this role pays about $197k versus about $215k for comparable roles.

Based on 239 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 563 open roles on FindRole.

Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior AI and ML HPC Cluster Engineer

Apply Now Log in to save

Join our GPU AI/HPC Infrastructure team as a senior technical leader responsible for designing and implementing cutting-edge GPU compute clusters that handle demanding deep learning, high-performance computing, and intensive computational tasks. You will lead the strategic challenges of compute, networking, and storage design in large-scale environments, optimize resource utilization in heterogeneous settings, and evolve our private/public cloud strategy. Your daily work involves deploying and managing HPC systems, developing scalable automation solutions for GPU-accelerated computing, building AI and ML clusters both on-premises and in the cloud, and supporting researchers with performance analysis and optimizations. Essential skills include experience with advanced job schedulers like Slurm or K8s, proficiency in Linux distributions such as CentOS/RHEL and Ubuntu, cluster configuration tools like Ansible, container technologies including Docker and Singularity, Python programming, bash scripting, MPI workflows, and familiarity with NVIDIA GPUs, CUDA, NCCL, MLPerf, InfiniBand, distributed storage systems, and deep learning frameworks.

Skills

Slurm Kubernetes Docker Ansible Python Bash MPI NVIDIA GPUs CUDA NCCL PyTorch TensorFlow Lustre InfiniBand IPoIB RDMA Puppet Salt Singularity Podman Charliecloud

What you'll do

Lead the design and implementation of GPU compute clusters for demanding workloads.
Develop scalable automation solutions to enhance GPU-accelerated computing ecosystems.
Build and maintain AI/ML heterogeneous clusters both on-premises and in the cloud.
Conduct performance analysis and optimizations for researchers' workloads.
Proactively identify and resolve issues before they impact system operations.

What we're looking for

Minimum 5+ years experience designing and operating large-scale compute infrastructure.
Expertise in AI/HPC job schedulers (Slurm, K8s, PBS) and container technologies (Docker, Singularity).
Proficiency in Linux administration (CentOS/RHEL, Ubuntu), Python programming, and bash scripting.
Solid understanding of cluster configuration management tools like Ansible, Puppet, Salt.
Experience analyzing and tuning performance for AI/HPC workloads using MPI and deep learning frameworks.

Similar roles

Senior HPC Performance Engineer - AI for Science at Scale

Nvidia

Santa Clara, CA 109 days ago $184,000–$287,500

CUDA Python C++ PyTorch JAX Warp HPC Distributed Learning Atomistic Modeling CI/CD Git Linux NVIDIA DGX Systems GPU Programming Parallel Computing Data Structures Algorithm Design Machine Learning Frameworks Scientific AI Codebases Computational Chemistry Digital Biology

Save

Senior HPC Cluster Engineer

Nvidia

Santa Clara, CA 87 days ago $152,000–$241,500

Slurm Kubernetes Python Bash Docker Enroot Prometheus Grafana Linux RHEL Ubuntu MPI NCCL CUDA NVIDIA_GPUs InfiniBand RDMA RoCE Lustre GPFS Ansible MLPerf

Save

Senior AI Compute Engineer - NVIS

Nvidia

Remote (Santa Clara, CA) 54 days ago $148,000–$235,750

Linux Bash Python Ansible SLURM LSF UGE Kubernetes HPL NCCL MLPerf InfiniBand MPI Lustre GPFS BCM Terraform CI/CD

Remote

Save

Senior AI and HPC Observability Engineer

Nvidia

Santa Clara, CA 96 days ago $152,000–$241,500

Python Go Java Kubernetes OpenTelemetry Prometheus Kafka Spark Flink PromQL Docker CI/CD Git Linux AWS GCP Azure

Save

Senior AI/ML Engineer

General Motors (GM)

Remote (Mountain View, CA) 4 days ago $170,600–$261,300

Python Transformers Generative_AI Multimodal_Systems AutoML Quantization Model_Distillation Architecture_Search CVPR ICML NeurIPS IJCAI KDD Robotics_Conference_Papers AV_ADAS_Experience

Remote Hybrid

Save

Senior Software Architect - Deep Learning and HPC Communications

Nvidia

Remote (Santa Clara, CA) 9 days ago $184,000–$287,500

C/C++ MPI NCCL NVSHMEM UCX CUDA Linux InfiniBand RoCE NVLink PyTorch TensorFlow HPC Networking Simulation Quantitative_Modeling SHMEM Parallel_Programming Deep_Learning_Pods

Remote

Save