Manager, Next-Gen AI Cluster Validation

Nvidia

Remote Actively hiring

Remote, USA · Santa Clara, CA Posted 11 days ago $224,000–$356,500 / year

View original post Log in to save

At a glance

AI generated

TL;DR

Join NVIDIA as a senior technical leader responsible for developing next-generation AI supercomputing systems. You will lead a distributed team in early deployments of new compute and networking technologies, collaborating with external partners to plan large-scale rollouts. Your daily tasks include leading system design development, building automation platforms, creating tooling and documentation for large-scale supercomputing, and working closely with internal teams on cluster architecture and integration. Ideal candidates have an advanced degree or equivalent experience in applied science or engineering, 8+ years of industry experience including technical leadership roles, proficiency in languages like Go, Python, and Ansible, and expertise in HPC, deep learning applications, high-performance datacenter networking, and open-source monitoring technologies.

Skills

Go Python Ansible Prometheus Grafana InfiniBand RoCE HPC AI GPU CI/CD Linux Networking Storage Supercomputing Machine_Learning Deep_Learning Kubernetes Terraform AWS Azure

What you'll do

Lead the development of next generation system designs integrating new compute, networking, storage, and software systems.
Build and support platforms for software development, systems automation, and performance engineering in AI and HPC.
Develop tooling and documentation to aid large-scale supercomputing systems deployment both internally and externally.
Collaborate with internal teams on cluster architecture, bringup, and integration of new technologies and products at scale.
Work closely with partners and customers to support the deployment and validation of clusters based on NVIDIA reference architectures.

What we're looking for

BS (Masters or PhD preferred) in Applied Science or Engineering with 8+ years experience in HPC or machine learning fields.
Proven ability to lead high-performing engineering teams across distributed groups.
Proficiency in software development and system automation using Go, Python, or Ansible.
Experience leading the build of large-scale HPC compute and storage systems.
Expertise in deep learning applications with multi-GPU and multi-node workloads.
Knowledge of high-performance datacenter networking technologies like InfiniBand and RoCE.
Familiarity with open-source monitoring tools such as Prometheus and Grafana.

Market check

Salary context

This $224,000–$356,500 range sits above 90% of similar postings on FindRole.

Peer median band

$181,700–$262,400

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$184,587–$249,490

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

Similar roles

Senior Datacenter Technical Program Manager, At-Scale AI Clusters

Nvidia

Remote (Us, Ca, Santa Clara, US) 14 days ago $168,000–$258,750

Prometheus Grafana Splunk Modbus BACNet Kubernetes Terraform AWS PostgreSQL CI/CD Python Docker Git Jenkins

Remote

Manager, Generative AI Advisory and Oversight

Capital One Financial

Mclean, Va, US 31 days ago $197,300–$225,100

AWS Azure Google Cloud NIST AI Risk Management Framework ISO 42001 OWASP Top 10 for LLM MITRE ATLAS CI/CD Python PostgreSQL Kubernetes Terraform Docker Prometheus Grafana

AI & GenAI Data Scientist – Senior Manager

PWC

New York - 300 Madison Avenue, US 109 days ago $124,000–$280,000

Python Pandas NLTK Langchain Semantic Kernel SQL NoSQL Azure AWS Google Cloud Git CI/CD RAG Vectorization Embedding Prompt Engineering GenAI LLM Development Unit Testing Integration Testing End-to-End Testing

Sr. Manager, Applied Science - Generative AI Data Research

Adobe

San Jose, US 60 days ago $242,600–$351,225

Python TensorFlow PyTorch Kubernetes AWS Google Cloud Azure CI/CD Docker PostgreSQL MongoDB Apache Hadoop Apache Spark Git Jenkins Prometheus Grafana MLOps

Manager, Deep Learning Algorithms

Nvidia

Us, Ca, Santa Clara, US 140 days ago $224,000–$356,500

Python TensorFlow PyTorch Large Language Models (LLMs) Large Visual-Language Models (VLMs) TensorRT-LLM vLLM SGLang JIRA Microsoft Project Git GitHub CI/CD Docker Kubernetes AWS NVIDIA GPU Deep Learning Performance Tuning Inference Optimization

Manager, Deep Learning Algorithms

Nvidia

Us, Ca, Santa Clara, US 122 days ago $184,000–$287,500

Python C++ TensorFlow PyTorch Large Language Models (LLMs) Large Visual-Language Models (VLMs) Inference platforms TensorRT-LLM vLLM SGLang JIRA Microsoft Project CI/CD Git GitHub Docker Kubernetes AWS GCP Azure