Senior Software Engineer, Distributed Systems Engineer - DGX Cloud

Nvidia

Remote

Quick summary

Work type
Remote
Location
Remote
Salary
$224,000–$356,500 / yr
Posted
13 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $192k
This role $290k
$126k most similar roles pay here $381k

This role pays more than 98% of similar roles. Most pay $160,000–$223,750 — the shaded band above. At the midpoint, this role pays about $290k versus about $192k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 563 open roles on FindRole.

Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior Software Engineer, Distributed Systems Engineer - DGX Cloud

NVIDIA seeks experienced software engineers to join its DGX Cloud team, responsible for scaling GPU clusters for AI workloads. In this role, you will design and develop a distributed platform to monitor and optimize GPU performance, ensuring reliability and consistency across large-scale production systems. You’ll collaborate with cross-functional teams to enhance system resilience through incident management processes. Ideal candidates have over 12 years of experience in software engineering roles within technical organizations, demonstrating proficiency in languages like Go or Python, and expertise in Kubernetes, Slurm, and other cluster management tools. A strong background in managing large-scale distributed systems and event-driven architectures is essential for this role that demands creativity, passion for GPUs, and a commitment to excellence in AI infrastructure solutions.

What you'll do

  • Design and develop a scalable platform to identify and fix underperforming GPU assets.
  • Ensure AI clusters run reliably with maximum performance across various workloads.
  • Evaluate system failures and implement improvements using an incident management process.
  • Collaborate with cross-functional teams to enhance the reliability of production systems.
  • Manage and automate large-scale distributed systems, including Kubernetes and Slurm.

What we're looking for

  • Significant software engineering experience with cluster operations and GPU resource scheduling.
  • Direct experience in a technical organization with demonstrable impact on large-scale production systems.
  • Proficiency in managing and automating large-scale distributed systems using Kubernetes or similar tools.
  • Strong communication skills for working effectively across multi-functional teams and geographies.
  • Technical knowledge including systems programming languages (Go, Python) and data structures.
  • Proven operational excellence in maintaining reliable and performant infrastructure.
  • Experience with asynchronous workflows and event-driven architecture.

More like this

Similar roles

Senior Production Engineer - DGX Cloud

Nvidia

Remote (CA) 7 days ago $168,000$270,250
Kubernetes Python Go Docker CI/CD Prometheus Grafana Terraform AWS Azure Slurm Bright_Cluster_Manager PostgreSQL Redis Git Jenkins Ansible Zabbix Nagios Fluentd
Remote

Senior Software Engineer, DGX Cloud AI Infrastructure

Nvidia

Remote (Santa Clara, CA) 2 days ago $184,000$287,500
PyTorch NVIDIA_NeMo Megatron_TRLM TensorRT-LLM Nsight_Systems NCCL CUDA RDMA IB_verbs UCX libfabric NVLink NVSwitch PCIe RoCE InfiniBand Python C++ Docker CI/CD
Remote

Senior Software Engineer - Distributed Systems

Apple Inc

Cupertino, CA 44 days ago $147,400$272,100
Go Rust Scala Kubernetes Docker CI/CD Prometheus Grafana PostgreSQL Redis AWS Azure GoogleCloud Git Jenkins Python JavaScript React Node.js REST GraphQL

Senior Network Engineer - DGX Cloud

Nvidia

Remote (Santa Clara, CA) 9 days ago $168,000$264,500
MP-BGP OSPF ISIS VRF VxLAN EVPN QoS GRE IPSEC DNS MACsec PNI Transit Exchange Passive DWDM Wave circuits Python Shell Arista Cumulus OS Fortinet OS ZTP CI/CD
Remote