Senior DGX Cloud AI Infrastructure Software Engineer

Nvidia

Remote Actively hiring
Santa Clara, CA · Austin, TX · OR · WA · Redmond, WA Posted 62 days ago $184,000$287,500 / year

At a glance

AI generated

TL;DR

As a senior AI infrastructure software engineer on NVIDIA’s DGX Cloud AI Efficiency Team, you will play a pivotal role in developing and maintaining the tools that optimize efficiency and resiliency for large-scale AI workloads. Your responsibilities include implementing robust software solutions to ensure high availability of AI systems, co-designing APIs with NVIDIA's resiliency stacks, and enhancing infrastructure underpinning AI platforms. You will also define reliability metrics to track system performance. Ideal candidates have over 8 years of experience in building scalable distributed systems for AI, proficiency in Python, C/C++, and scripting languages, and expertise in observability tools like ELK, Prometheus, and Loki. Additionally, familiarity with RDMA software stacks such as NCCL and ucx is beneficial. This role offers the chance to work on cutting-edge technologies that drive advancements in AI and data science within a collaborative environment focused on iterative improvement and risk-taking.

Skills

Python C/C++ Prometheus Loki ELK CI/CD Git PyTorch TensorFlow JAX Ray NCCL IB_verbs ucx libfabrics Docker Kubernetes AWS GCP Azure

What you'll do

  • Develop infrastructure software and tools for large-scale AI pre-training, post-training, and inference.
  • Optimize tools and libraries to enhance efficiency and resiliency of AI workloads.
  • Co-design and implement APIs for integration with NVIDIA's resiliency stacks.
  • Enhance infrastructure underpinning NVIDIA’s AI platforms to support scalability and reliability.
  • Define and track reliability metrics to improve system and service stability.
  • Analyze failures from application level down to hardware level for root cause identification.

What we're looking for

  • Minimum 8+ years experience in developing software infrastructure for large-scale AI systems
  • Strong debugging and root cause analysis skills across application to hardware levels
  • Experience with observability platforms like ELK, Prometheus, and Loki
  • Proven track record in building and scaling distributed systems
  • Proficiency in Python, C/C++, and scripting languages
  • Expertise in quality software engineering practices including testing and CI/CD
  • Background in working with large-scale clusters and RDMA software stacks

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $199k
This role $236k
$143k most similar roles pay here $303k

This role pays more than 77% of similar roles. Most pay $162,000–$235,750 — the shaded band above. At the midpoint, this role pays about $236k versus about $199k for comparable roles.

Based on 239 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 824 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 812 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Senior Staff AI Platform Engineer

Nvidia

Santa Clara, CA 72 days ago $168,000$270,250
Python Kubernetes C++ Go Rust MLOps Hugging Face Weights & Biases NVIDIA NIM Prometheus Grafana Docker CI/CD AWS Azure Google Cloud Platform PostgreSQL MySQL Redis Git GitHub Jenkins Terraform Ansible Knative OpenTelemetry FedRAMP SOC 2

Senior AI Cloud Platform Engineer

Allstate

Remote (Usa - Il (Remote), US) 83 days ago $85,000$145,075
VertexAI AzureAIFoundry AWSBedrock OpenAI AgenticAI Terraform Python GoogleCloudPlatform Azure AWSCloudServices InfrastructureAsCode CI/CD PromptEngineering LLMAPIConsumption
Remote

Lead AI Cloud Platform Engineer

Allstate

Remote (Usa - Il (Remote), US) 83 days ago $110,000$181,025
AWS Azure Google Vertex AI Terraform Python GCP OpenAI Agentic AI CI/CD Infrastructure as Code Kubernetes Docker Prometheus PostgreSQL
Remote

Senior AI Platform Engineer

Adobe

San Jose 79 days ago $211,800$306,625
TypeScript Python Java Go C++ LLMs Terraform Kubernetes Docker CI/CD Prometheus Grafana PostgreSQL Redis Elasticsearch AWS Azure Google Cloud Platform Git Jenkins GitHub Slack Confluence Jira Swagger OpenAPI GraphQL RESTful APIs Microservices MVP

Principal Software Engineer, DGX Cloud Production Engineering

Nvidia

Remote (Santa Clara, CA) 16 days ago $272,000$431,250
Kubernetes Go Python GitOps Linux Docker Terraform CI/CD Prometheus Grafana PostgreSQL AWS Azure Google Cloud Platform GPU AI ML SLOs observability incident response automation BMaaS VMaaS
Remote