Lead Software Engineer, Fleet Management - DGX Cloud

Nvidia

Remote

Quick summary

Work type
Remote
Location
Seattle, WA · Santa Clara, CA
Salary
$224,000–$356,500 / yr
Posted
47 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $209k
This role $290k
$167k most similar roles pay here $377k

This role pays more than 97% of similar roles. Most pay $187,390–$230,400 — the shaded band above. At the midpoint, this role pays about $290k versus about $209k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 985 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 971 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Lead Software Engineer, Fleet Management - DGX Cloud

The Lead Software Engineer role at NVIDIA’s DGX Cloud team involves designing and leading the development of scalable cloud services for high-performance GPU infrastructure in datacenters. Day-to-day responsibilities include technical leadership over a team, creating RESTful APIs to ingest telemetry data, building and managing large-scale data pipelines, and optimizing operational efficiency across global cloud operations. The ideal candidate should have extensive experience with PostgreSQL-compatible databases, proficiency in Go or Python, familiarity with modern JavaScript frameworks like React or Angular, and expertise in cloud infrastructure such as AWS, GCP, Azure, Docker, and Kubernetes. Additionally, the role requires a deep understanding of high-scale distributed systems and strong communication skills to collaborate effectively on complex operational challenges within the fast-growing AI and cloud computing domain.

What you'll do

  • Act as technical lead for designing cloud services backed by databases and data warehouses.
  • Design and develop RESTful APIs to ingest telemetry from AI datacenters.
  • Build scalable cloud services for high-volume ingestion, processing, and storage of large datasets.
  • Build and manage data pipelines for online and offline data storage.
  • Optimize the reliability and efficiency of cloud services and operations.
  • Lead impactful technical projects ensuring quality and scalability at every stage.

What we're looking for

  • At least 12+ years of industry experience with a Bachelor’s or Master’s degree in a relevant field.
  • Expertise in building scalable REST APIs using Go or Python backed by PostgreSQL-compatible data stores.
  • Proficiency in modern JavaScript frameworks (React, Angular, Next.js) and cloud infrastructure technologies (AWS, GCP, Azure).
  • Deep knowledge of container technologies like Docker and Kubernetes, and high-scale distributed systems architecture.
  • Strong leadership experience in delivering scalable and efficient cloud services at Internet scale with a focus on reliability and efficiency.
  • Familiarity with Linux operating systems and hands-on experience operating NVIDIA datacenter GPUs.

More like this

Similar roles

Principal Software Engineer, DGX Cloud Production Engineering

Nvidia

Remote (Santa Clara, CA) 19 days ago $272,000$431,250
Kubernetes Go Python GitOps Linux Docker Terraform CI/CD Prometheus Grafana PostgreSQL AWS Azure Google Cloud Platform GPU AI ML SLOs observability incident response automation BMaaS VMaaS
Remote

Senior Full Stack Software Engineer - DGX Cloud

Nvidia

Remote (Us, Nc, Remote, US) 14 days ago $224,000$356,500
React TypeScript JavaScript Golang PostgreSQL Kubernetes SQL CI/CD Bazel Temporal Slurm Docker Prometheus Git Linux Python GraphQL
Remote

Senior Technical Program Manager, DGX Cloud Software Products and Services

Nvidia

Santa Clara, CA 33 days ago $168,000$258,750
Jira Aha! Confluence Git Distributed version control systems Reliability engineering Resilience development Service performance metrics Goodput Efficiency Utilization Distributed training frameworks Checkpointing NCCL Slurm AI infrastructure Large-scale compute platforms CI/CD