Senior AI and HPC Observability Engineer

Nvidia

Actively hiring
Santa Clara, US · Seattle, US Posted 88 days ago $152,000$241,500 / year

At a glance

AI generated

TL;DR

As an AI & HPC Observability Engineer at NVIDIA’s Managed AI Superclusters (MARS) team, you will design and scale observability platforms to handle high-volume metrics, logs, and traces across distributed environments. Your day-to-day responsibilities include building high-performance backend services for telemetry ingestion, processing, and routing, as well as developing OpenTelemetry collectors and instrumentation libraries. You will also optimize metrics pipelines using large-scale time-series storage systems and collaborate with platform engineering teams to ensure operational excellence. The role requires expertise in Python, Go, or Java, along with hands-on experience with modern observability architectures like PromQL and Kafka, and familiarity with Kubernetes and cloud-native infrastructure. Ideal candidates have a background in data engineering and real-time performance tuning for AI/ML pipelines, GPU workload monitoring, and intelligent alerting systems.

Skills

Python Go Java Kubernetes OpenTelemetry Prometheus Kafka Spark Flink PromQL Docker CI/CD Git Linux AWS GCP Azure

What you'll do

  • Design and scale observability platforms to handle high-volume metrics, logs, and traces in distributed environments.
  • Build high-performance backend services for telemetry ingestion, processing, and routing.
  • Develop and extend OpenTelemetry collectors, processors, exporters, and instrumentation libraries.
  • Optimize metrics pipelines using large-scale time-series storage systems.
  • Design real-time and batch telemetry pipelines using streaming and distributed data technologies.
  • Improve platform reliability, performance, and cost efficiency through tuning and system optimization.
  • Develop monitoring, alerting, and service reliability frameworks to ensure platform health and performance.

What we're looking for

  • 5+ years of experience building backend or distributed systems in production environments
  • Strong programming skills in Python, Go, or Java with production-quality software development
  • Hands-on experience with modern observability architectures including metrics, logs, and traces
  • Solid experience with PromQL and time-series data systems like Prometheus
  • Experience building or operating distributed data pipelines using Kafka, Spark, or Flink
  • Strong understanding of distributed systems, concurrency, fault tolerance, debugging, performance tuning, and production operations

Market check

Salary context

This $152,000–$241,500 range sits above 41% of similar postings on FindRole.

Peer median band

$170,300$262,400

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$181,758$246,150

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Senior HPC Performance Engineer - AI for Science at Scale

Nvidia

Us, Ca, Santa Clara, US 101 days ago $184,000$287,500
CUDA Python C++ PyTorch JAX Warp HPC Distributed Learning Atomistic Modeling CI/CD Git Linux NVIDIA DGX Systems GPU Programming Parallel Computing Data Structures Algorithm Design Machine Learning Frameworks Scientific AI Codebases Computational Chemistry Digital Biology

Senior Software Architect - Deep Learning and HPC Communications

Nvidia

Remote (Us, Ca, Santa Clara, US) 20 days ago $184,000$287,500
C/C++ MPI NCCL NVSHMEM UCX CUDA Linux InfiniBand RoCE NVLink PyTorch TensorFlow HPC Networking Simulation Quantitative_Modeling SHMEM Parallel_Programming Deep_Learning_Pods
Remote

Senior HPC and Quantum Systems Engineer

Nvidia

Remote (Us, Ma, Westford, US) 140 days ago $224,000$356,500
Linux Slurm CUDA-Q cuQuantum NVQlink Python C++ NVIDIA HPC GPU Neutral atom Trapped ion Superconducting Photonic Qiskit Cirq PennyLane Braket Networking Storage Data center infrastructure Quantum computing
Remote

Senior HPC and Quantum Systems Engineer

Nvidia

Remote (Us, Ma, Westford, US) 59 days ago $184,000$287,500
CUDA-Q cuQuantum NVQlink Linux Slurm Python C++ NVIDIA HPC Neutral atom Trapped ion Superconducting Photonic Qiskit Cirq PennyLane Braket APIs Middleware Orchestration frameworks Networking Storage Data center environments Quantum computing concepts Automation Workflow orchestration Real-time control considerations
Remote

Senior AI Compute Engineer - NVIS

Nvidia

Remote (Us, Ca, Santa Clara, US) 46 days ago $148,000$235,750
Linux Bash Python Ansible SLURM LSF UGE Kubernetes HPL NCCL MLPerf InfiniBand MPI Lustre GPFS BCM Terraform CI/CD
Remote