Senior Software Engineer, DGX Cloud AI Infrastructure

Nvidia

Remote

Quick summary

Work type
Remote
Location
Santa Clara, CA · Austin, TX · OR · Redmond, WA
Salary
$184,000–$287,500 / yr
Posted
2 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $214k
This role $236k
$165k most similar roles pay here $301k

This role pays more than 71% of similar roles. Most pay $182,125–$246,150 — the shaded band above. At the midpoint, this role pays about $236k versus about $214k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 563 open roles on FindRole.

Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior Software Engineer, DGX Cloud AI Infrastructure

NVIDIA seeks a Senior Software Engineer to lead the bring-up, triage, benchmarking, analysis, and optimization of distributed training and inference workloads across its GPU platforms at massive scales. This hands-on role involves setting technical direction for communication libraries, model frameworks, and stacks to ensure efficient and reliable operation of large language models. Responsibilities include deep performance and reliability investigations on multi-GPU and multi-node deployments, defining benchmarking criteria, building resilience capabilities, and conducting root-cause analysis for complex failures. The ideal candidate has extensive experience in developing software infrastructure for large-scale AI or HPC systems, expertise with NCCL and CUDA-aware distributed execution, and proficiency in Python and C/C++. Knowledge of GPU cluster fabrics, RDMA software stacks, and experience with containerized clusters are also crucial.

What you'll do

  • Lead the bring-up, validation, and debugging of large-scale AI clusters and workloads.
  • Tune and benchmark AI pre-training, post-training, and inference workloads using NVIDIA software stacks.
  • Profile and optimize end-to-end workload performance across compute, memory, networking layers.
  • Analyze scaling efficiency for distributed LLM workloads and provide tuning guidance.
  • Own root-cause analysis of complex failures in large distributed environments.
  • Define and build resilience and failure-attribution stack for detecting and triaging node failures.

What we're looking for

  • 8+ years of experience developing software infrastructure for large-scale AI or HPC systems with technical leadership.
  • Expertise in debugging and triaging AI applications across the full stack, including hardware and application layers.
  • Deep hands-on experience with NCCL, CUDA-aware distributed execution, and multi-GPU/multi-node workload scaling.
  • Proven track record of architecting, debugging, and scaling large-scale distributed systems.
  • Expert-level Python and C/C++ programming skills for AI software development.
  • Experience building acceptance tests, benchmark harnesses, and qualification tooling for AI platforms.

More like this

Similar roles

Software Engineer, DGX Cloud AI Infrastructure

Nvidia

Remote (Santa Clara, CA) 3 days ago $116,000$189,750
PyTorch NVIDIA_NeMo Megatron_TRLM TensorRT-LLM CUDA NCCL RDMA InfiniBand RoCE UCX libfabric MLPerf CI/CD Docker Kubernetes Prometheus Grafana
Remote

Senior Staff AI Platform Engineer

Nvidia

Santa Clara, CA 75 days ago $168,000$270,250
Python Kubernetes C++ Go Rust MLOps Hugging Face Weights & Biases NVIDIA NIM Prometheus Grafana Docker CI/CD AWS Azure Google Cloud Platform PostgreSQL MySQL Redis Git GitHub Jenkins Terraform Ansible Knative OpenTelemetry FedRAMP SOC 2

Lead AI Cloud Platform Engineer

Allstate

Remote (Usa - Il (Remote), US) 86 days ago $110,000$181,025
AWS Azure Google Vertex AI Terraform Python GCP OpenAI Agentic AI CI/CD Infrastructure as Code Kubernetes Docker Prometheus PostgreSQL
Remote