Principal AI and ML Infra Software Engineer, GPU Clusters

Nvidia

Quick summary

Work type: On-site
Location: Santa Clara, CARedmond, WA
Salary: $272,000–$431,250 / yr
Posted: 50 days ago
Nearby: 99+ roles within 25 mi

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $210k

This role $352k

$140k most similar roles pay here $462k

This role pays more than 99% of similar roles. Most pay $173,150–$246,150 — the shaded band above. At the midpoint, this role pays about $352k versus about $210k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 980 open roles on FindRole.

Listed pay typically runs $168,000–$270,250 across 966 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Principal AI and ML Infra Software Engineer, GPU Clusters

Apply Now Log in to save

Join NVIDIA’s Hardware Infrastructure team as a Principal AI and ML Infra Software Engineer, where you will collaborate closely with researchers to identify and address infrastructure deficiencies in GPU Clusters, driving the development of scalable solutions. Your daily tasks include optimizing infrastructure performance for high availability and efficient resource utilization while working across various teams like data engineers and DevOps professionals to build an integrated ecosystem. You’ll need extensive experience in AI/ML and HPC systems, including hands-on knowledge of accelerated computing, storage, scheduling, orchestration, networking, and container technologies. Proficiency in Python, Go, Bash, and familiarity with cloud platforms are essential, alongside expertise in distributed training operations using frameworks like PyTorch and NeMo. This role demands a commitment to continuous learning and staying updated on the latest AI/ML advancements.

Skills

Python Go Docker Kubernetes Slurm PyTorch NeMo JAX AWS GCP Azure Lustre GPFS BeeGFS Infiniband RoCE Amazon EFA HPC GPU CI/CD

What you'll do

Identify and address infrastructure deficiencies by working closely with AI/ML research teams.
Lead initiatives to improve researcher efficiency and develop long-term roadmaps.
Monitor and optimize the performance of GPU clusters for high availability and scalability.
Define and enhance measures of AI researcher efficiency to ensure measurable results.
Develop a cohesive AI/ML infrastructure ecosystem by collaborating with various technical teams.

What we're looking for

15+ years of demonstrated expertise in AI/ML and HPC tasks and systems.
Hands-on experience with HPC infrastructure including GPUs, storage solutions, scheduling tools, high-speed networking, and containers.
Capability in supervising distributed training operations using PyTorch, NeMo, or JAX and understanding AI/ML workflows.
Proficiency in Python, Go, Bash, and familiarity with major cloud computing platforms.
Dedication to continuous learning and staying updated on new technologies in the AI/ML infrastructure sector.
Excellent communication and collaboration skills for effective teamwork across diverse backgrounds.

Similar roles

Senior AI and ML HPC Cluster Engineer

Nvidia

Remote (Santa Clara, CA) +1 53 days ago $152,000–$241,500

Slurm Kubernetes Docker Ansible Python Bash MPI NVIDIA GPUs CUDA NCCL PyTorch TensorFlow Lustre InfiniBand IPoIB RDMA Puppet Salt Singularity Podman Charliecloud

Remote

Save

Senior AI Compute Engineer - NVIS

Nvidia

Remote (Santa Clara, CA) 64 days ago $148,000–$235,750

Linux Bash Python Ansible SLURM LSF UGE Kubernetes HPL NCCL MLPerf InfiniBand MPI Lustre GPFS BCM Terraform CI/CD

Remote

Save

GPU Software Architecture Engineer, Graphics, Games, & ML

Apple Inc

Cupertino, CA 60 days ago $181,100–$318,400

CUDA ROCm C/C++ InfiniBand RDMA NCCL PyTorch JAX TensorFlow Distributed Systems Parallel Computing Performance Profiling Pipeline Parallelism Expert Parallelism System Programming ML Infrastructure Python

Save

Sr Engineer, Machine Learning Engineering (Heterogenous SW, Adreno GPU)

Qualcomm

San Diego, CA 42 days ago $140,800–$211,200

Python PyTorch C++ CI/CD Linux Git Docker OpenCL CUDA Android Windows CPU GPU NPU Model Quantization Profiling SDK Development Heterogeneous Platforms Agile Methodology

Save

Principal AI/ML Engineer, AV ML Infra

General Motors (GM)

Mountain View, California 34 days ago $275,800–$340,500

Python Golang C++ Google Cloud Platform Microsoft Azure Kubernetes Kubeflow Flyte Airflow RayServe vLLM Triton PostgreSQL NoSQL CI/CD

Hybrid

Save

GPU AI Compiler Engineer

Qualcomm

San Diego, CA 46 days ago $141,600–$212,400

C/C++ LLVM SYCL CUDA OpenCL MLIR GPU architecture Machine Learning Graph Compiler Data structures Algorithms Object-oriented programming

Save