Senior Software Engineer, DGX Cloud AI Infrastructure

Nvidia

Remote

Quick summary

Work type: Remote
Location: Santa Clara, CA · Austin, TX · OR · Redmond, WA
Salary: $184,000–$287,500 / yr
Posted: 2 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $214k

This role $236k

$165k most similar roles pay here $301k

This role pays more than 71% of similar roles. Most pay $182,125–$246,150 — the shaded band above. At the midpoint, this role pays about $236k versus about $214k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 563 open roles on FindRole.

Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior Software Engineer, DGX Cloud AI Infrastructure

Apply Now Log in to save

NVIDIA seeks a Senior Software Engineer to lead the bring-up, triage, benchmarking, analysis, and optimization of distributed training and inference workloads across its GPU platforms at massive scales. This hands-on role involves setting technical direction for communication libraries, model frameworks, and stacks to ensure efficient and reliable operation of large language models. Responsibilities include deep performance and reliability investigations on multi-GPU and multi-node deployments, defining benchmarking criteria, building resilience capabilities, and conducting root-cause analysis for complex failures. The ideal candidate has extensive experience in developing software infrastructure for large-scale AI or HPC systems, expertise with NCCL and CUDA-aware distributed execution, and proficiency in Python and C/C++. Knowledge of GPU cluster fabrics, RDMA software stacks, and experience with containerized clusters are also crucial.

Skills

PyTorch NVIDIA_NeMo Megatron_TRLM TensorRT-LLM Nsight_Systems NCCL CUDA RDMA IB_verbs UCX libfabric NVLink NVSwitch PCIe RoCE InfiniBand Python C++ Docker CI/CD

What you'll do

Lead the bring-up, validation, and debugging of large-scale AI clusters and workloads.
Tune and benchmark AI pre-training, post-training, and inference workloads using NVIDIA software stacks.
Profile and optimize end-to-end workload performance across compute, memory, networking layers.
Analyze scaling efficiency for distributed LLM workloads and provide tuning guidance.
Own root-cause analysis of complex failures in large distributed environments.
Define and build resilience and failure-attribution stack for detecting and triaging node failures.

What we're looking for

8+ years of experience developing software infrastructure for large-scale AI or HPC systems with technical leadership.
Expertise in debugging and triaging AI applications across the full stack, including hardware and application layers.
Deep hands-on experience with NCCL, CUDA-aware distributed execution, and multi-GPU/multi-node workload scaling.
Proven track record of architecting, debugging, and scaling large-scale distributed systems.
Expert-level Python and C/C++ programming skills for AI software development.
Experience building acceptance tests, benchmark harnesses, and qualification tooling for AI platforms.

Similar roles

Software Engineer, DGX Cloud AI Infrastructure

Nvidia

Remote (Santa Clara, CA) 3 days ago $116,000–$189,750

PyTorch NVIDIA_NeMo Megatron_TRLM TensorRT-LLM CUDA NCCL RDMA InfiniBand RoCE UCX libfabric MLPerf CI/CD Docker Kubernetes Prometheus Grafana

Remote

Save

Senior Staff AI Platform Engineer

Nvidia

Santa Clara, CA 75 days ago $168,000–$270,250

Python Kubernetes C++ Go Rust MLOps Hugging Face Weights & Biases NVIDIA NIM Prometheus Grafana Docker CI/CD AWS Azure Google Cloud Platform PostgreSQL MySQL Redis Git GitHub Jenkins Terraform Ansible Knative OpenTelemetry FedRAMP SOC 2

Save