Manager, Next-Gen AI Cluster Validation

Nvidia

Remote Actively hiring
Remote, USA · Santa Clara, CA Posted 11 days ago $224,000$356,500 / year

At a glance

AI generated

TL;DR

Join NVIDIA as a senior technical leader responsible for developing next-generation AI supercomputing systems. You will lead a distributed team in early deployments of new compute and networking technologies, collaborating with external partners to plan large-scale rollouts. Your daily tasks include leading system design development, building automation platforms, creating tooling and documentation for large-scale supercomputing, and working closely with internal teams on cluster architecture and integration. Ideal candidates have an advanced degree or equivalent experience in applied science or engineering, 8+ years of industry experience including technical leadership roles, proficiency in languages like Go, Python, and Ansible, and expertise in HPC, deep learning applications, high-performance datacenter networking, and open-source monitoring technologies.

Skills

Go Python Ansible Prometheus Grafana InfiniBand RoCE HPC AI GPU CI/CD Linux Networking Storage Supercomputing Machine_Learning Deep_Learning Kubernetes Terraform AWS Azure

What you'll do

  • Lead the development of next generation system designs integrating new compute, networking, storage, and software systems.
  • Build and support platforms for software development, systems automation, and performance engineering in AI and HPC.
  • Develop tooling and documentation to aid large-scale supercomputing systems deployment both internally and externally.
  • Collaborate with internal teams on cluster architecture, bringup, and integration of new technologies and products at scale.
  • Work closely with partners and customers to support the deployment and validation of clusters based on NVIDIA reference architectures.

What we're looking for

  • BS (Masters or PhD preferred) in Applied Science or Engineering with 8+ years experience in HPC or machine learning fields.
  • Proven ability to lead high-performing engineering teams across distributed groups.
  • Proficiency in software development and system automation using Go, Python, or Ansible.
  • Experience leading the build of large-scale HPC compute and storage systems.
  • Expertise in deep learning applications with multi-GPU and multi-node workloads.
  • Knowledge of high-performance datacenter networking technologies like InfiniBand and RoCE.
  • Familiarity with open-source monitoring tools such as Prometheus and Grafana.

Market check

Salary context

This $224,000–$356,500 range sits above 90% of similar postings on FindRole.

Peer median band

$181,700$262,400

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$184,587$249,490

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Manager, Generative AI Advisory and Oversight

Capital One Financial

Mclean, Va, US 31 days ago $197,300$225,100
AWS Azure Google Cloud NIST AI Risk Management Framework ISO 42001 OWASP Top 10 for LLM MITRE ATLAS CI/CD Python PostgreSQL Kubernetes Terraform Docker Prometheus Grafana

AI & GenAI Data Scientist – Senior Manager

PWC

New York - 300 Madison Avenue, US 109 days ago $124,000$280,000
Python Pandas NLTK Langchain Semantic Kernel SQL NoSQL Azure AWS Google Cloud Git CI/CD RAG Vectorization Embedding Prompt Engineering GenAI LLM Development Unit Testing Integration Testing End-to-End Testing

Manager, Deep Learning Algorithms

Nvidia

Us, Ca, Santa Clara, US 140 days ago $224,000$356,500
Python TensorFlow PyTorch Large Language Models (LLMs) Large Visual-Language Models (VLMs) TensorRT-LLM vLLM SGLang JIRA Microsoft Project Git GitHub CI/CD Docker Kubernetes AWS NVIDIA GPU Deep Learning Performance Tuning Inference Optimization

Manager, Deep Learning Algorithms

Nvidia

Us, Ca, Santa Clara, US 122 days ago $184,000$287,500
Python C++ TensorFlow PyTorch Large Language Models (LLMs) Large Visual-Language Models (VLMs) Inference platforms TensorRT-LLM vLLM SGLang JIRA Microsoft Project CI/CD Git GitHub Docker Kubernetes AWS GCP Azure