Senior AI Infrastructure Software Engineer - DGX Cloud

Nvidia

Remote Actively hiring
Remote, USA · Santa Clara, CA · Redmond, WA · Seattle, WA Posted 17 days ago $184,000$287,500 / year

At a glance

AI generated

TL;DR

As a senior AI infrastructure software engineer on NVIDIA's DGX Cloud Lepton Team, you will design, build, and maintain large-scale AI platforms that enable efficient training, inferencing, and fine-tuning of models. Your day-to-day responsibilities include developing tools to optimize AI/ML workload efficiency, analyzing failures from application to hardware levels, enhancing infrastructure for reliability, and co-designing APIs with NVIDIA's resiliency stacks. You will also define metrics to track system reliability and collaborate in a culture that values learning and iterative improvement. The role requires expertise in Python, C/C++, Kubernetes, observability platforms like ELK and Prometheus, and experience with AI frameworks such as PyTorch and TensorFlow. Additionally, knowledge of NVIDIA GPUs, RDMA networks, and cloud-native infrastructure is essential for this high-impact position at the forefront of AI innovation.

Skills

Kubernetes Python C Prometheus Loki ELK TensorFlow PyTorch JAX Ray NCCL RDMA IB NVIDIA GPUs CI/CD

What you'll do

  • Develop platform and tools for large-scale AI, LLM, and GenAI infrastructure.
  • Optimize tools to enhance efficiency and resiliency of AI/ML workloads.
  • Analyze failures from application level down to hardware level for root cause identification.
  • Enhance NVIDIA's AI platforms by co-designing and implementing APIs with resilience stacks.
  • Define reliability metrics to track and improve system and service stability.
  • Build and scale large-scale distributed systems for AI applications.
  • Work on Kubernetes and observability platforms for monitoring and logging AI services.

What we're looking for

  • Minimum 8+ years experience in developing large-scale AI software infrastructure.
  • Strong background in debugging AI applications across application and hardware levels.
  • Proven track record in building, scaling, and optimizing distributed systems.
  • Expertise in AI training, inferencing, and data infrastructure services.
  • Proficiency in Kubernetes, observability platforms (ELK, Prometheus), Python/C++.
  • Deep understanding of NVIDIA GPUs, network technologies, and DL frameworks.

Market check

Salary context

This $184,000–$287,500 range sits above 73% of similar postings on FindRole.

Peer median band

$168,000$258,750

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$165,852$246,150

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Senior Software Engineer, AI Networking

Nvidia

Us, Ca, Santa Clara, US 15 days ago $152,000$241,500
Python PyTorch TensorFlow JAX CUDA NCCL Reinforcement_Learning Bayesian_Optimization GNNs Docker Kubernetes CI/CD Prometheus Grafana Bash C++ PostgreSQL Redis

Senior AI Cloud Platform Engineer

Allstate

Remote (Usa - Il (Remote), US) 78 days ago $85,000$145,075
VertexAI AzureAIFoundry AWSBedrock OpenAI AgenticAI Terraform Python GoogleCloudPlatform Azure AWSCloudServices InfrastructureAsCode CI/CD PromptEngineering LLMAPIConsumption
Remote

Senior AI Platform Engineer- Data and Systems

Adobe

San Jose, US 31 days ago $208,300$301,600
Apache_Spark Databricks Delta_Lake Kafka Kinesis Flink Python Scala SQL AWS Azure Docker Kubernetes CI/CD MCP LangChain LLMs Feature_Stores RAG Unity_Catalog FAISS Pinecone Weaviate Semantic_layers DataHub OpenMetadata AI-powered_developer_tools

Senior Solution Architect, AI Infrastructure

Nvidia

Remote (Us, Dc, Remote, US) 18 days ago $184,000$287,500
NVIDIA_GPUs NVIDIA_Networking InfiniBand Ethernet NCCL DCGM UFM Mission_Control Base_Command_Manager AI_solutions High_Performance_Computing Networking Python CI/CD Git AWS Azure Grafana Prometheus
Remote