Senior Solutions Architect, AI Cluster Performance and Telemetry

Nvidia

Quick summary

Work type
On-site
Location
Santa Clara, CA · Austin, TX
Salary
$184,000–$287,500 / yr
Posted
2 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $214k
This role $236k
$167k most similar roles pay here $300k

This role pays more than 73% of similar roles. Most pay $180,906–$246,150 — the shaded band above. At the midpoint, this role pays about $236k versus about $214k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 985 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 971 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior Solutions Architect, AI Cluster Performance and Telemetry

As a Senior Solutions Architect specializing in Data Center Systems & Performance, you will join our elite team and serve as a technical expert bridging engineering, field teams, and customers with demanding requirements. Your primary responsibilities include identifying and resolving complex performance bottlenecks across GPU, CPU, and networking systems, maintaining robust benchmarking suites to stress-test high-performance clusters, and applying industry-standard tools to monitor hardware performance counters and extract system telemetry. You will collaborate closely with internal engineering units and external partners to develop solutions that enhance infrastructure performance. The ideal candidate has a strong background in system build, performance analysis, and technical customer-facing roles, along with expertise in CPUs, GPUs, high-speed networking fabrics, and tools like Perf, eBPF, Prometheus, Grafana, Docker, Kubernetes, and SLURM. Additionally, you should have experience optimizing distributed AI training workloads and integrating Agentic AI frameworks to automate cluster triage.

What you'll do

  • Analyze and resolve complex performance bottlenecks in GPU, CPU, and networking systems.
  • Develop and maintain benchmarking suites for high-performance clusters.
  • Use industry-standard tools to monitor hardware performance counters and extract system telemetry.
  • Investigate configurations to identify and fix issues impacting peak performance.
  • Collaborate with internal teams and customers to enhance infrastructure performance.

What we're looking for

  • 8+ years of industry experience in system build, performance analysis, and technical customer-facing roles.
  • Strong understanding of CPU-GPU interactions and high-speed networking fabrics in massive clusters.
  • Practical experience with performance tools like Perf, eBPF, Prometheus, and Grafana.
  • Experience working with containers, cloud provisioning, and scheduling tools such as Docker, Kubernetes, SLURM.
  • Ability to transform raw telemetry into structured time series data and create actionable narratives.
  • Deep knowledge of multi-GPU communication libraries and NVIDIA hardware architectures.
  • Practical experience optimizing distributed AI training workloads and integrating agentic AI frameworks.

More like this

Similar roles

Senior Solutions Architect - Enterprise AI

Nvidia

Remote (Santa Clara, CA) 9 days ago $184,000$287,500
Ethernet InfiniBand SDN Kubernetes Docker AWS Azure GCP CI/CD Prometheus PostgreSQL Terraform Cisco Arista Cloud Engineer Juniper Networks TCO analysis Python Ansible
Remote

Senior Solutions Architect, AI Infrastructure

Nvidia

Austin, TX 10 days ago $184,000$287,500
NVIDIA_Ethernet InfiniBand GPUs CPUs PCIe DPUs NICs HCAs switches rack_scale_design system_hardware_architecture kernel_drivers PCIe_devices AI_data_centers CI/CD

Senior Solutions Architect, AI Infrastructure

Nvidia

Austin, TX 8 days ago $184,000$287,500
NVIDIA_Ethernet InfiniBand GPUs CPUs PCIe DPUs NICs HCAs switches rack_scale_design large_scale_GPU_infra_deployments system_hardware_architecture kernel_drivers PCIe_devices AI_data_center_networking

Senior Solution Architect, AI Infrastructure

Nvidia

Remote (Us, Dc, Remote, US) 26 days ago $184,000$287,500
NVIDIA_GPUs NVIDIA_Networking InfiniBand Ethernet NCCL DCGM UFM Mission_Control Base_Command_Manager AI_solutions High_Performance_Computing Networking Python CI/CD Git AWS Azure Grafana Prometheus
Remote

Senior Solutions Architect, AI Hyperscalers

Nvidia

Remote (Canada) 24 days ago $184,000$287,500
Python CUDA PyTorch JAX Linux Docker Kubernetes HPC GPU Distributed Training Inference Optimization Vector Databases RAG Pipelines Multi-node Clusters Deep Learning Frameworks
Remote