AI Benchmarking and Telemetry Engineer - NVIS

Nvidia

Remote

Quick summary

Work type: Remote
Location: Santa Clara, CA
Salary: $184,000–$287,500 / yr
Posted: 105 days ago
Nearby: 99+ roles within 25 mi

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $203k

This role $236k

$143k most similar roles pay here $303k

This role pays more than 72% of similar roles. Most pay $162,000–$244,050 — the shaded band above. At the midpoint, this role pays about $236k versus about $203k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 855 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 843 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · AI Benchmarking and Telemetry Engineer - NVIS

Apply Now Log in to save

Join our team as an experienced HPC/AI Benchmarking and Telemetry Engineer to drive performance insights across advanced computing infrastructure. You’ll develop and execute detailed benchmarking approaches for large-scale GPU clusters, assess metrics to identify optimization opportunities, and build telemetry frameworks capturing system performance data from host-level GPUs through network infrastructure. Collaborate with engineering teams and customers to define requirements, fix bottlenecks, and validate configurations against real-world workloads using tools like Prometheus, Grafana, and NVIDIA’s DCGM. Maintain expertise in Linux administration, GPU architectures, and scripting languages for automation and analysis. Ideal candidates have 8+ years of experience in HPC/AI infrastructure, including cluster deployment and performance analysis, with a background in high-performance networking technologies and parallel filesystems.

Skills

Prometheus Grafana NVIDIA DCGM Linux system administration Python Bash CUDA Kubernetes Docker Slurm InfiniBand RoCE Lustre GPFS BeeGFS Weka VAST HPL HPCG MLPerf NCCL

What you'll do

Develop and execute detailed benchmarking methods for large-scale GPU clusters.
Build telemetry frameworks to capture system performance data across multiple levels.
Deploy observability stacks including monitoring tools like Prometheus and Grafana.
Collaborate with teams to define performance requirements and validate configurations.
Maintain knowledge of industry-standard benchmarks in HPC and machine learning fields.
Craft and implement telemetry solutions for large-scale distributed systems.

What we're looking for

8+ years of experience with HPC and AI infrastructure, including deployment and performance analysis.
Deep expertise in Linux system administration and kernel tuning for large-scale systems.
Proven experience crafting telemetry and monitoring solutions using tools like Prometheus and Grafana.
Solid understanding of GPU architectures and CUDA programming principles for HPC/AI workloads.
Proficiency in Python and Bash scripting for automation, data analysis, and workflow orchestration.
Excellent analytical skills to interpret complex performance data and communicate findings effectively.

Similar roles

Senior AI Compute Engineer - NVIS

Nvidia

Remote (Santa Clara, CA) 52 days ago $148,000–$235,750

Linux Bash Python Ansible SLURM LSF UGE Kubernetes HPL NCCL MLPerf InfiniBand MPI Lustre GPFS BCM Terraform CI/CD

Remote

Save

AI Systems Performance Engineer

Broadcom

San Jose, CA 45 days ago $141,300–$226,000

Linux Python C++ PyTorch MLPerf NCCL Ethernet RDMA RoCEv2 CI/CD Docker Kubernetes

Save

Applied AI Engineer

Broadcom

Palo Alto, CA 23 days ago $141,300–$226,000

Python Kubernetes Terraform Docker CI/CD VMware vSphere vSAN NSX Aria AWS GCP Azure PostgreSQL MongoDB Redis Prometheus Grafana GitLab Jenkins

Save

Applied AI Engineer

Ramp

Remote (New York City, New York, US) 143 days ago $155,000–$339,500

Python JavaScript Node.js Django Flask React PostgreSQL MongoDB AWS GCP Kubernetes Terraform CI/CD GitOps

Remote

Save

Applied AI Engineer

Booz Allen Hamilton

Fort Belvoir, VA 20 days ago $99,000–$225,000

Python FastAPI Flask Streamlit Gradio React TypeScript Kubernetes CI/CD Prometheus Grafana MLOps Docker PostgreSQL AWS Azure Google Cloud Platform

Save

| Microsoft Careers

Microsoft

US 16 days ago

Python C# CI/CD Terraform Azure RBAC Managed Identities Secrets Management Prometheus Grafana Docker Kubernetes PostgreSQL LLM Prompt Engineering RAG Responsible AI Test-Driven Development Feature Flags Staged Rollouts

Save