Principal Software Engineer, Distributed Systems Engineer - DGX Cloud

Nvidia

Actively hiring Verified listing
Durham, US Posted 16 days ago $248,000$396,750 / year

At a glance

AI generated

TL;DR

NVIDIA seeks an experienced Software Engineer with Kubernetes expertise to join its DGX Cloud team, responsible for scaling AI infrastructure through advanced GPU resource management and monitoring. This role involves developing custom software for scheduling GPU resources on Kubernetes, implementing health management capabilities, and ensuring the reliability and scalability of large-scale GPU clusters. You will work closely with cross-functional teams to optimize performance and address system failures using a well-defined incident management process. Ideal candidates have over 15 years of experience in similar roles, extensive knowledge of Kubernetes APIs and frameworks, and proficiency in systems programming languages like Go or Python. The position requires deep technical skills in managing distributed systems, cluster management tools such as Kubernetes and Slurm, and a solid understanding of data structures and algorithms to drive innovation in AI infrastructure solutions.

Skills

Kubernetes Python Go Docker CI/CD Prometheus Grafana Terraform AWS PostgreSQL Slurm Bright_Cluster_Manager Git Jenkins Ansible Bash C++ CUDA

What you'll do

  • Develop custom software for scheduling GPU resources on Kubernetes clusters.
  • Implement monitoring and health management capabilities for reliable GPU asset usage.
  • Evaluate system failures to improve services using a well-defined incident process.
  • Work with cross-functional teams to ensure consistent performance in AI clusters.
  • Utilize multiple data streams, including GPU diagnostics and network telemetry.

What we're looking for

  • Significant software engineering experience with Kubernetes in production environments.
  • Expertise in cluster operations, operator development, and GPU resource scheduling.
  • Strong background in monitoring and health management for scalable GPU clusters.
  • Experience implementing custom software related to Kubernetes API and frameworks.
  • BS in Computer Science, Engineering, Physics, Mathematics or equivalent technical degree.
  • Proficiency in systems programming languages like Go or Python and data structures.

Market check

Salary context

This $248,000–$396,750 range sits above 95% of similar postings on FindRole.

Peer median band

$147,030$258,125

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$171,761$235,750

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Senior Full Stack Software Engineer - DGX Cloud

Nvidia

Remote (Us, Nc, Remote, US) 11 days ago $224,000$356,500
React TypeScript JavaScript Golang PostgreSQL Kubernetes SQL CI/CD Bazel Temporal Slurm Docker Prometheus Git Linux Python GraphQL
Remote

Principal Software Engineer - Compute Infrastructure

Nvidia

Remote (Us, Ca, Santa Clara, US) 16 days ago $248,000$391,000
Kubernetes OpenShift Terraform Go Python GitOps ArgoCD AWS GCP NFSv4 NVMe/TCP Hyperconverged storage CI/CD Microservices Self-service architecture SLAs
Remote

Principal Software Engineer - Networking Hyperscale Engineering

Nvidia

Us, Wa, Seattle, US 14 days ago $248,000$391,000
C/C++ Linux DPDK RDMA RoCE NIC firmware DOCA NCCL BGP ECMP EVPN/VXLAN GPU-related networking GPUDirect RDMA High-performance computing Distributed training stacks Hyperscalers Cloud providers Open-source ecosystems

Principal Software Engineer - DGX Cloud

Nvidia

Us, Ca, Santa Clara, US 30 days ago $272,000$431,250
Python Kubernetes Go AWS Prometheus Grafana OpenTelemetry Docker CI/CD Java CUDA cuDNN

Principal Software Engineer, DGX Cloud Production Engineering

Nvidia

Remote (Us, Ca, Santa Clara, US) 11 days ago $272,000$431,250
Kubernetes Go Python GitOps Linux Docker Terraform CI/CD Prometheus Grafana PostgreSQL AWS Azure Google Cloud Platform GPU AI ML SLOs observability incident response automation BMaaS VMaaS
Remote