Senior Solutions Architect - AI Factory Deployment

Nvidia

Remote

Quick summary

Work type
Remote
Location
Austin, TXDurham, NCSanta Clara, CA
Salary
$184,000–$287,500 / yr
Posted
46 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $208k
This role $236k
$156k most similar roles pay here $302k

This role pays more than 70% of similar roles. Most pay $170,000–$246,150 — the shaded band above. At the midpoint, this role pays about $236k versus about $208k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 967 open roles on FindRole.

Listed pay typically runs $168,000–$270,250 across 950 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior Solutions Architect - AI Factory Deployment

The Senior Solutions Architect - AI Factory Deployment role within NVIDIA’s Infrastructure Specialists team in Santa Clara involves developing and deploying end-to-end AI factories. Day-to-day responsibilities include setting up and validating multi-GPU and multi-node Linux clusters for AI/LLM workloads, ensuring optimal performance through NCCL and collective communication patterns like AllReduce and AllToAll. The candidate will also build observability tools and automation scripts in Python and Shell to monitor and optimize benchmarks, collaborating with hardware and software teams to prepare AI factories for customer deployment. Essential skills include extensive experience managing Linux-based systems in HPC or distributed settings, proficiency with PyTorch or TensorFlow, and a solid understanding of collective communication patterns in modern ML/LLM training.

What you'll do

  • Set up and verify AI factory environments on multi-GPU Linux clusters.
  • Execute key AI/LLM benchmarks and analyze results for performance optimization.
  • Investigate and resolve issues in training jobs or benchmarks that fail or underperform.
  • Build observability tools to monitor workload behavior and system health.
  • Develop automation scripts for running benchmarks and collecting results.
  • Recommend changes to improve throughput, latency, and scaling efficiency of AI workloads.

What we're looking for

  • Over 6 years of experience managing Linux-based systems in HPC or extensive AI/ML settings.
  • Hands-on experience with multi-GPU/multi-node clusters and NCCL.
  • Solid understanding of collective communication patterns like AllReduce and AllToAll.
  • Proficiency in Python and Shell/Bash for scripting, automation, and tooling.
  • Experience with benchmarking and interpreting performance benchmarks.
  • Comfortable working with observability data to troubleshoot complex distributed workloads.
  • Strong cross-functional team collaboration and communication skills.

More like this

Similar roles

Senior AI Solutions Architect

Nvidia

Remote (Santa Clara, CA) 6 days ago $152,000$241,500
Python C/C++ PyTorch Tensorflow Kubernetes GitHub NVIDIA CUDA Docker Prometheus Grafana CI/CD PostgreSQL AWS Azure MLOps
Remote

Senior Solutions Architect, AI Infrastructure

Nvidia

Austin, TX +3 18 days ago $184,000$287,500
NVIDIA_Ethernet InfiniBand GPUs CPUs PCIe DPUs NICs HCAs switches rack_scale_design system_hardware_architecture kernel_drivers PCIe_devices AI_data_centers CI/CD

Senior Solutions Architect, AI Infrastructure

Nvidia

Austin, TX +3 16 days ago $184,000$287,500
NVIDIA_Ethernet InfiniBand GPUs CPUs PCIe DPUs NICs HCAs switches rack_scale_design large_scale_GPU_infra_deployments system_hardware_architecture kernel_drivers PCIe_devices AI_data_center_networking

Senior Solution Architect, AI Infrastructure

Nvidia

Remote (Us, Dc, Remote, US) 34 days ago $184,000$287,500
NVIDIA_GPUs NVIDIA_Networking InfiniBand Ethernet NCCL DCGM UFM Mission_Control Base_Command_Manager AI_solutions High_Performance_Computing Networking Python CI/CD Git AWS Azure Grafana Prometheus
Remote