Senior Solutions Architect, AI Factory Observability and Visualization

Nvidia

Remote

Quick summary

Work type
Remote
Location
Austin, TXDurham, NCSanta Clara, CA
Salary
$184,000–$287,500 / yr
Posted
4 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $212k
This role $236k
$158k most similar roles pay here $301k

This role pays more than 70% of similar roles. Most pay $177,250–$246,150 — the shaded band above. At the midpoint, this role pays about $236k versus about $212k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 950 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 939 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior Solutions Architect, AI Factory Observability and Visualization

NVIDIA's Infrastructure Specialists team is seeking a Senior Solutions Architect specializing in AI Factory Observability & Visualization to develop comprehensive visibility for HPC systems and AI factories. This role involves running microbenchmarks and workloads to assess system health, establishing observability metrics, and building telemetry across hardware, fabric, and workload layers. The architect will automate data collection and transformation using Python and Shell scripts, collaborate with cross-functional teams to ensure readiness for customer deployment, and recommend improvements to enhance visibility and performance insights. Candidates should have a strong background in managing Linux-based systems in HPC or AI settings, experience with observability tools like Prometheus and Grafana, and proficiency in GPU and fabric telemetry. The ideal candidate will also possess hands-on expertise in multi-GPU clusters and distributed system architecture, contributing to the seamless operation of large-scale AI infrastructure.

What you'll do

  • Run and interpret microbenchmarks and workloads to assess system health and performance.
  • Establish comprehensive metrics and thresholds for system readiness across hardware and software.
  • Build telemetry systems to collect, transform, store, and present complex data.
  • Serve as observability expert by identifying gaps in visibility and ensuring accurate system behavior representation.
  • Develop automation scripts using Python and Shell for data collection and presentation.

What we're looking for

  • 6+ years managing Linux-based HPC/AI systems.
  • Comprehensive understanding of multi-GPU/node clusters and network architecture.
  • Proficiency in Python and Shell scripting for automation.
  • Experience with observability tools like Prometheus and Grafana.
  • Expertise in transforming telemetry data into actionable insights.
  • Familiarity with GPU and fabric telemetry for performance diagnosis.

More like this

Similar roles

Senior Solutions Architect - AI Factory Deployment

Nvidia

Remote (Austin, TX) +2 60 days ago $184,000$287,500
Linux Python Shell NCCL AllReduce AllToAll PyTorch TensorFlow Bash Benchmarking Metrics Messaging_Systems Logging Tracing CI/CD HPC GPU_Clusters Distributed_Systems Observability Automation
Remote

Senior AI Compute Engineer

Nvidia

Remote (Santa Clara, CA) 76 days ago $148,000$235,750
Linux Bash Python Ansible SLURM LSF UGE Kubernetes HPL NCCL MLPerf InfiniBand MPI Lustre GPFS BCM Terraform CI/CD
Remote

Senior Solution Architect, AI Infrastructure

Nvidia

Remote (DC) 48 days ago $184,000$287,500
NVIDIA_GPUs NVIDIA_Networking RoCE InfiniBand NCCL DCGM UFM Mission_Control Base_Command_Manager High_Performance_Computing AI_Solutions CI/CD Python PostgreSQL Kubernetes AWS Docker Prometheus Grafana
Remote

Senior Solutions Architect, AI Infrastructure

Nvidia

Austin, TX +3 32 days ago $184,000$287,500
NVIDIA_Ethernet InfiniBand GPUs CPUs PCIe DPUs NICs HCAs switches rack_scale_design system_hardware_architecture kernel_drivers PCIe_devices AI_data_centers POCs_for_AI_solutions