Senior AI and HPC Observability Engineer

Nvidia

Actively hiring

Santa Clara, US · Seattle, US Posted 88 days ago $152,000–$241,500 / year

View original post Log in to save

At a glance

AI generated

TL;DR

As an AI & HPC Observability Engineer at NVIDIA’s Managed AI Superclusters (MARS) team, you will design and scale observability platforms to handle high-volume metrics, logs, and traces across distributed environments. Your day-to-day responsibilities include building high-performance backend services for telemetry ingestion, processing, and routing, as well as developing OpenTelemetry collectors and instrumentation libraries. You will also optimize metrics pipelines using large-scale time-series storage systems and collaborate with platform engineering teams to ensure operational excellence. The role requires expertise in Python, Go, or Java, along with hands-on experience with modern observability architectures like PromQL and Kafka, and familiarity with Kubernetes and cloud-native infrastructure. Ideal candidates have a background in data engineering and real-time performance tuning for AI/ML pipelines, GPU workload monitoring, and intelligent alerting systems.

Skills

Python Go Java Kubernetes OpenTelemetry Prometheus Kafka Spark Flink PromQL Docker CI/CD Git Linux AWS GCP Azure

What you'll do

Design and scale observability platforms to handle high-volume metrics, logs, and traces in distributed environments.
Build high-performance backend services for telemetry ingestion, processing, and routing.
Develop and extend OpenTelemetry collectors, processors, exporters, and instrumentation libraries.
Optimize metrics pipelines using large-scale time-series storage systems.
Design real-time and batch telemetry pipelines using streaming and distributed data technologies.
Improve platform reliability, performance, and cost efficiency through tuning and system optimization.
Develop monitoring, alerting, and service reliability frameworks to ensure platform health and performance.

What we're looking for

5+ years of experience building backend or distributed systems in production environments
Strong programming skills in Python, Go, or Java with production-quality software development
Hands-on experience with modern observability architectures including metrics, logs, and traces
Solid experience with PromQL and time-series data systems like Prometheus
Experience building or operating distributed data pipelines using Kafka, Spark, or Flink
Strong understanding of distributed systems, concurrency, fault tolerance, debugging, performance tuning, and production operations

Market check

Salary context

This $152,000–$241,500 range sits above 41% of similar postings on FindRole.

Peer median band

$170,300–$262,400

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$181,758–$246,150

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

Similar roles

Senior HPC Performance Engineer - AI for Science at Scale

Nvidia

Us, Ca, Santa Clara, US 101 days ago $184,000–$287,500

CUDA Python C++ PyTorch JAX Warp HPC Distributed Learning Atomistic Modeling CI/CD Git Linux NVIDIA DGX Systems GPU Programming Parallel Computing Data Structures Algorithm Design Machine Learning Frameworks Scientific AI Codebases Computational Chemistry Digital Biology

Senior HPC and AI Networking Performance Research and Analysis Engineer

Nvidia

Us, Ca, Santa Clara, US 53 days ago $152,000–$241,500

Python C Bash CUDA NCCL RDMA MPI TensorFlow PyTorch RoCE Linux Intel CPUs AMD CPUs ARM CPUs NVIDIA GPUs HCA PCI Performance Analysis CI/CD

Senior Software Architect - Deep Learning and HPC Communications

Nvidia

Remote (Us, Ca, Santa Clara, US) 20 days ago $184,000–$287,500

C/C++ MPI NCCL NVSHMEM UCX CUDA Linux InfiniBand RoCE NVLink PyTorch TensorFlow HPC Networking Simulation Quantitative_Modeling SHMEM Parallel_Programming Deep_Learning_Pods

Remote

Senior HPC and Quantum Systems Engineer

Nvidia

Remote (Us, Ma, Westford, US) 140 days ago $224,000–$356,500

Linux Slurm CUDA-Q cuQuantum NVQlink Python C++ NVIDIA HPC GPU Neutral atom Trapped ion Superconducting Photonic Qiskit Cirq PennyLane Braket Networking Storage Data center infrastructure Quantum computing

Remote

Senior HPC and Quantum Systems Engineer

Nvidia

Remote (Us, Ma, Westford, US) 59 days ago $184,000–$287,500

CUDA-Q cuQuantum NVQlink Linux Slurm Python C++ NVIDIA HPC Neutral atom Trapped ion Superconducting Photonic Qiskit Cirq PennyLane Braket APIs Middleware Orchestration frameworks Networking Storage Data center environments Quantum computing concepts Automation Workflow orchestration Real-time control considerations

Remote

Senior AI Compute Engineer - NVIS

Nvidia

Remote (Us, Ca, Santa Clara, US) 46 days ago $148,000–$235,750

Linux Bash Python Ansible SLURM LSF UGE Kubernetes HPL NCCL MLPerf InfiniBand MPI Lustre GPFS BCM Terraform CI/CD

Remote