AI Benchmarking and Telemetry Engineer - NVIS

Nvidia

Remote

Quick summary

Work type
Remote
Location
Santa Clara, CA
Salary
$184,000–$287,500 / yr
Posted
105 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $203k
This role $236k
$143k most similar roles pay here $303k

This role pays more than 72% of similar roles. Most pay $162,000–$244,050 — the shaded band above. At the midpoint, this role pays about $236k versus about $203k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 855 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 843 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · AI Benchmarking and Telemetry Engineer - NVIS

Join our team as an experienced HPC/AI Benchmarking and Telemetry Engineer to drive performance insights across advanced computing infrastructure. You’ll develop and execute detailed benchmarking approaches for large-scale GPU clusters, assess metrics to identify optimization opportunities, and build telemetry frameworks capturing system performance data from host-level GPUs through network infrastructure. Collaborate with engineering teams and customers to define requirements, fix bottlenecks, and validate configurations against real-world workloads using tools like Prometheus, Grafana, and NVIDIA’s DCGM. Maintain expertise in Linux administration, GPU architectures, and scripting languages for automation and analysis. Ideal candidates have 8+ years of experience in HPC/AI infrastructure, including cluster deployment and performance analysis, with a background in high-performance networking technologies and parallel filesystems.

What you'll do

  • Develop and execute detailed benchmarking methods for large-scale GPU clusters.
  • Build telemetry frameworks to capture system performance data across multiple levels.
  • Deploy observability stacks including monitoring tools like Prometheus and Grafana.
  • Collaborate with teams to define performance requirements and validate configurations.
  • Maintain knowledge of industry-standard benchmarks in HPC and machine learning fields.
  • Craft and implement telemetry solutions for large-scale distributed systems.

What we're looking for

  • 8+ years of experience with HPC and AI infrastructure, including deployment and performance analysis.
  • Deep expertise in Linux system administration and kernel tuning for large-scale systems.
  • Proven experience crafting telemetry and monitoring solutions using tools like Prometheus and Grafana.
  • Solid understanding of GPU architectures and CUDA programming principles for HPC/AI workloads.
  • Proficiency in Python and Bash scripting for automation, data analysis, and workflow orchestration.
  • Excellent analytical skills to interpret complex performance data and communicate findings effectively.

More like this

Similar roles

Senior AI Compute Engineer - NVIS

Nvidia

Remote (Santa Clara, CA) 52 days ago $148,000$235,750
Linux Bash Python Ansible SLURM LSF UGE Kubernetes HPL NCCL MLPerf InfiniBand MPI Lustre GPFS BCM Terraform CI/CD
Remote

Applied AI Engineer

Broadcom

Palo Alto, CA 23 days ago $141,300$226,000
Python Kubernetes Terraform Docker CI/CD VMware vSphere vSAN NSX Aria AWS GCP Azure PostgreSQL MongoDB Redis Prometheus Grafana GitLab Jenkins

Applied AI Engineer

Ramp

Remote (New York City, New York, US) 143 days ago $155,000$339,500
Python JavaScript Node.js Django Flask React PostgreSQL MongoDB AWS GCP Kubernetes Terraform CI/CD GitOps
Remote

Applied AI Engineer

Booz Allen Hamilton

Fort Belvoir, VA 20 days ago $99,000$225,000
Python FastAPI Flask Streamlit Gradio React TypeScript Kubernetes CI/CD Prometheus Grafana MLOps Docker PostgreSQL AWS Azure Google Cloud Platform

| Microsoft Careers

Microsoft

US 16 days ago
Python C# CI/CD Terraform Azure RBAC Managed Identities Secrets Management Prometheus Grafana Docker Kubernetes PostgreSQL LLM Prompt Engineering RAG Responsible AI Test-Driven Development Feature Flags Staged Rollouts