Principal AI and ML Infra Software Engineer, GPU Clusters

Nvidia

Quick summary

Work type
On-site
Location
Santa Clara, CARedmond, WA
Salary
$272,000–$431,250 / yr
Posted
50 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $210k
This role $352k
$140k most similar roles pay here $462k

This role pays more than 99% of similar roles. Most pay $173,150–$246,150 — the shaded band above. At the midpoint, this role pays about $352k versus about $210k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 980 open roles on FindRole.

Listed pay typically runs $168,000–$270,250 across 966 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Principal AI and ML Infra Software Engineer, GPU Clusters

Join NVIDIA’s Hardware Infrastructure team as a Principal AI and ML Infra Software Engineer, where you will collaborate closely with researchers to identify and address infrastructure deficiencies in GPU Clusters, driving the development of scalable solutions. Your daily tasks include optimizing infrastructure performance for high availability and efficient resource utilization while working across various teams like data engineers and DevOps professionals to build an integrated ecosystem. You’ll need extensive experience in AI/ML and HPC systems, including hands-on knowledge of accelerated computing, storage, scheduling, orchestration, networking, and container technologies. Proficiency in Python, Go, Bash, and familiarity with cloud platforms are essential, alongside expertise in distributed training operations using frameworks like PyTorch and NeMo. This role demands a commitment to continuous learning and staying updated on the latest AI/ML advancements.

What you'll do

  • Identify and address infrastructure deficiencies by working closely with AI/ML research teams.
  • Lead initiatives to improve researcher efficiency and develop long-term roadmaps.
  • Monitor and optimize the performance of GPU clusters for high availability and scalability.
  • Define and enhance measures of AI researcher efficiency to ensure measurable results.
  • Develop a cohesive AI/ML infrastructure ecosystem by collaborating with various technical teams.

What we're looking for

  • 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems.
  • Hands-on experience with HPC infrastructure including GPUs, storage solutions, scheduling tools, high-speed networking, and containers.
  • Capability in supervising distributed training operations using PyTorch, NeMo, or JAX and understanding AI/ML workflows.
  • Proficiency in Python, Go, Bash, and familiarity with major cloud computing platforms.
  • Dedication to continuous learning and staying updated on new technologies in the AI/ML infrastructure sector.
  • Excellent communication and collaboration skills for effective teamwork across diverse backgrounds.

More like this

Similar roles

Senior AI and ML HPC Cluster Engineer

Nvidia

Remote (Santa Clara, CA) +1 53 days ago $152,000$241,500
Slurm Kubernetes Docker Ansible Python Bash MPI NVIDIA GPUs CUDA NCCL PyTorch TensorFlow Lustre InfiniBand IPoIB RDMA Puppet Salt Singularity Podman Charliecloud
Remote

Senior AI Compute Engineer - NVIS

Nvidia

Remote (Santa Clara, CA) 64 days ago $148,000$235,750
Linux Bash Python Ansible SLURM LSF UGE Kubernetes HPL NCCL MLPerf InfiniBand MPI Lustre GPFS BCM Terraform CI/CD
Remote

GPU Software Architecture Engineer, Graphics, Games, & ML

Apple Inc

Cupertino, CA 60 days ago $181,100$318,400
CUDA ROCm C/C++ InfiniBand RDMA NCCL PyTorch JAX TensorFlow Distributed Systems Parallel Computing Performance Profiling Pipeline Parallelism Expert Parallelism System Programming ML Infrastructure Python

Principal AI/ML Engineer, AV ML Infra

General Motors (GM)

Mountain View, California 34 days ago $275,800$340,500
Python Golang C++ Google Cloud Platform Microsoft Azure Kubernetes Kubeflow Flyte Airflow RayServe vLLM Triton PostgreSQL NoSQL CI/CD
Hybrid

GPU AI Compiler Engineer

Qualcomm

San Diego, CA 46 days ago $141,600$212,400
C/C++ LLVM SYCL CUDA OpenCL MLIR GPU architecture Machine Learning Graph Compiler Data structures Algorithms Object-oriented programming