Senior AI Infrastructure Engineer - DGX Cloud

Nvidia

Remote

Quick summary

Work type
Remote
Location
Santa Clara, CA
Salary
$152,000–$241,500 / yr
Posted
2 days ago

Market check

Salary context

Competitive pay

How this pay compares to similar roles

Similar $207k
This role $197k
$139k most similar roles pay here $274k

This role pays less than 60% of similar roles. Most pay $167,187–$246,150 — the shaded band above. At the midpoint, this role pays about $197k versus about $207k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 985 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 971 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior AI Infrastructure Engineer - DGX Cloud

NVIDIA’s DGX Cloud group seeks a Senior AI Infrastructure Engineer to design, build, and maintain large-scale production systems for AI training and inferencing platforms. This role involves deploying internal tooling, conducting performance analysis on multi-GPU clusters, and supporting services throughout their lifecycle. The engineer will ensure system reliability through automation, capacity management, and continuous improvement, participating in an on-call rotation to handle production issues. Ideal candidates have a BS degree or equivalent experience in computer science or related fields, with 5+ years of hands-on experience in infrastructure automation, distributed systems architecture, and cloud technologies like Kubernetes and Terraform. Proficiency in Python, Go, C/C++, Java, Linux, networking, storage, and containers is essential, along with a passion for large-scale system management and performance optimization.

What you'll do

  • Design and deploy internal tooling for large-scale AI training and inferencing platforms.
  • Conduct performance analysis on multi-GPU and multi-node clusters.
  • Engage in the full lifecycle of services including design, deployment, and operation.
  • Maintain system health by monitoring availability and latency post-deployment.
  • Scale systems sustainably through automation to enhance reliability and velocity.
  • Participate in an on-call rotation for production system support.

What we're looking for

  • 5+ years of experience in infrastructure automation and distributed systems.
  • BS degree in Computer Science or related technical field with coding emphasis.
  • Experience with Linux, networking, storage, containers, and public cloud technologies.
  • Proficiency in Python, Go, C/C++, Java for building large-scale production tools.
  • Expertise in Kubernetes, Terraform, and Infrastructure as Code (IAAC) practices.
  • Capability to design, deploy, and maintain high-performance AI training platforms.

More like this

Similar roles

Software Engineer, DGX Cloud AI Infrastructure

Nvidia

Remote (Santa Clara, CA) 3 days ago $116,000$189,750
PyTorch NVIDIA_NeMo Megatron_TRLM TensorRT-LLM CUDA NCCL RDMA InfiniBand RoCE UCX libfabric MLPerf CI/CD Docker Kubernetes Prometheus Grafana
Remote

Senior Software Engineer, DGX Cloud AI Infrastructure

Nvidia

Remote (Santa Clara, CA) 2 days ago $184,000$287,500
PyTorch NVIDIA_NeMo Megatron_TRLM TensorRT-LLM Nsight_Systems NCCL CUDA RDMA IB_verbs UCX libfabric NVLink NVSwitch PCIe RoCE InfiniBand Python C++ Docker CI/CD
Remote

Senior AI Cloud Platform Engineer

Allstate

Remote (Usa - Il (Remote), US) 86 days ago $85,000$145,075
VertexAI AzureAIFoundry AWSBedrock OpenAI AgenticAI Terraform Python GoogleCloudPlatform Azure AWSCloudServices InfrastructureAsCode CI/CD PromptEngineering LLMAPIConsumption
Remote

Senior AI Platform Engineer- Data and Systems

Adobe

San Jose 39 days ago $208,300$301,600
Apache_Spark Databricks Delta_Lake Kafka Kinesis Flink Python Scala SQL AWS Azure Docker Kubernetes CI/CD MCP LangChain LLMs Feature_Stores RAG Unity_Catalog FAISS Pinecone Weaviate Semantic_layers DataHub OpenMetadata AI-powered_developer_tools