Software Engineer, DGX Cloud AI Infrastructure

Nvidia

Remote

Quick summary

Work type
Remote
Location
Santa Clara, CA · Austin, TX · OR · WA · Redmond, WA
Salary
$116,000–$189,750 / yr
Posted
3 days ago

Market check

Salary context

Below market

How this pay compares to similar roles

Similar $215k
This role $153k
$99k most similar roles pay here $272k

This role pays less than 85% of similar roles. Most pay $183,643–$246,150 — the shaded band above. At the midpoint, this role pays about $153k versus about $215k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 563 open roles on FindRole.

Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Software Engineer, DGX Cloud AI Infrastructure

NVIDIA seeks a Senior Software Engineer to join its cutting-edge AI systems team, focusing on the bring-up, triage, benchmarking, analysis, and optimization of distributed training and inference workloads across multi-GPU and multi-node deployments. This hands-on role involves validating large-scale AI clusters, tuning and benchmarking pre-training and post-training workloads using PyTorch, NeMo/Megatron, TensorRT-LLM, and other NVIDIA software stacks, while also contributing to the development of resilience and failure-attribution tooling. The ideal candidate will have a background in multi-GPU or multi-node workloads, CUDA-aware distributed execution, and experience with NCCL, RDMA software stack debugging, and containerized cluster environments. Proficiency in Python, C++, and familiarity with MLPerf benchmarks are essential for this role, which demands strong analytical skills and the ability to collaborate effectively across teams.

What you'll do

  • Bring up, validate, and debug large-scale AI clusters and end-to-end workloads.
  • Tune and benchmark AI pre-training and inference workloads using NVIDIA software stacks.
  • Perform root-cause analysis of failures in distributed environments.
  • Contribute to resilience tooling for detecting and attributing node and workload failures.
  • Build repeatable benchmark suites and qualification workflows on new platforms.
  • Deliver data-driven recommendations based on profiling, benchmark results, and cluster characterization.

What we're looking for

  • 3+ years of experience developing software for AI or HPC systems.
  • Hands-on experience with multi-GPU/multi-node workloads and CUDA-aware distributed execution.
  • Background in debugging and scaling distributed systems across the full stack.
  • Experience operating containerized cluster environments for scheduled workloads.
  • Deep familiarity with RDMA software stacks and InfiniBand/RoCE congestion debugging.
  • Experience building acceptance tests, benchmark harnesses, and qualification tooling for AI platforms.

More like this

Similar roles

Senior Software Engineer, DGX Cloud AI Infrastructure

Nvidia

Remote (Santa Clara, CA) 2 days ago $184,000$287,500
PyTorch NVIDIA_NeMo Megatron_TRLM TensorRT-LLM Nsight_Systems NCCL CUDA RDMA IB_verbs UCX libfabric NVLink NVSwitch PCIe RoCE InfiniBand Python C++ Docker CI/CD
Remote

Lead AI Cloud Platform Engineer

Allstate

Remote (Usa - Il (Remote), US) 86 days ago $110,000$181,025
AWS Azure Google Vertex AI Terraform Python GCP OpenAI Agentic AI CI/CD Infrastructure as Code Kubernetes Docker Prometheus PostgreSQL
Remote

AI Platform Engineer (Google Cloud Platform)

The Hartford

Hartford, CT 8 days ago $117,200$175,800
Python Terraform Google Cloud Platform BigQuery Cloud Functions AI Platform API Gateway GKE Docker CI/CD Agile methodologies NoSQL ETL Chatbots HAI IR Vector Embedding Hybrid/Semantic Search HNSW Product Quantization LLM orchestration frameworks Generative AI Guardrails Responsible AI Adversarial Attack Mitigation Natural Language Processing Deep Learning AWS
Hybrid

Software Engineer, Ai & Data Platforms

Apple Inc

Austin, TX 31 days ago
Python Go Docker AWS Azure Google Cloud Kubernetes Terraform VS Code TypeScript Node.js JetBrains IDEs IntelliJ Platform SDK Git CI/CD Prometheus Grafana

Software Engineer (AI/GenAI Platforms)

Allstate

Charlotte Railyard 71 days ago $85,000$145,075
Python AWS Java LangChain Hugging Face OpenAI Amazon SageMaker MongoDB Atlas Amazon DocumentDB Apache Kafka Datadog AWS CloudWatch CI/CD LLMs RAG Vector Search & Embeddings Multimodal AI Prompt Engineering Semantic Models