AI Inference Performance Engineer

Nvidia

Hybrid

Quick summary

Work type: Hybrid
Location: Santa Clara, CA
Salary: $152,000–$241,500 / yr
Posted: 89 days ago

Market check

Salary context

Competitive pay

How this pay compares to similar roles

Similar $211k

This role $197k

$139k most similar roles pay here $271k

This role pays less than 65% of similar roles. Most pay $175,740–$246,150 — the shaded band above. At the midpoint, this role pays about $197k versus about $211k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 563 open roles on FindRole.

Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · AI Inference Performance Engineer

Apply Now Log in to save

As a senior performance engineer on NVIDIA’s DL Architecture team, you will drive industry benchmark results by optimizing end-to-end inference pipelines for TensorRT-LLM, SGLang, and vLLM, focusing on quantization, scheduling, memory management, and distributed inference. You’ll define cutting-edge benchmarks, collaborate with framework teams to enhance performance on large-scale models, architect distributed systems from single-GPU to rack-scale clusters, and establish robust performance methodologies using roofline analysis and profiling tools. Additionally, you will contribute to open-source projects, influence GPU roadmaps, and lead a high-impact technical team under tight deadlines. This role requires expertise in Python or C++, deep learning frameworks like PyTorch, experience with large language models and vision-language workloads, and proficiency in CUDA programming and kernel development.

Skills

Python C++ PyTorch JAX TensorRT-LLM vLLM SGLang CUDA MPI NCCL K8s CUTLASS cuteDSL tilelang OpenAI_Triton torch.compile GPU FPGA roofline_analysis performance_profiling

What you'll do

Drive end-to-end optimization pipeline for GenAI inference on NVIDIA accelerators.
Define and optimize next-generation AI workloads and benchmarks across various models.
Architect distributed inference systems from single-GPU to rack-scale clusters.
Apply roofline analysis and profiling to identify performance bottlenecks in CUDA kernels.
Contribute to open-source projects like TensorRT-LLM, vLLM, and SGLang for GPU optimization.

What we're looking for

5+ years of software development experience with Python or C++
Expertise in deep learning frameworks like PyTorch or JAX
Proven ability to deliver measurable performance improvements in DL inference
Deep understanding of LLM/VLM architectures and inference mechanics
Experience with large-scale GPU clusters and scale-out inference orchestration
Expertise in kernel development for GPUs (CUDA, CUTLASS) and compiler/runtime paths
Track record of leading high-impact technical programs across teams under tight deadlines

Similar roles

AI Inference Performance Engineer - New College Grad 2026

Nvidia

Santa Clara, CA 3 days ago $124,000–$195,500

Python C++ PyTorch JAX TensorRT-LLM vLLM SGLang CUDA CUTLASS cuteDSL tilelang OpenAI_Triton torch.compile MPI NCCL K8s roofline_analysis performance_profiling GPU_programming deep_learning_inference

Save

Applied AI Engineer

Booz Allen Hamilton

Fort Belvoir, VA 22 days ago $99,000–$225,000

Python FastAPI Flask Streamlit Gradio React TypeScript Kubernetes CI/CD Prometheus Grafana MLOps Docker PostgreSQL AWS Azure Google Cloud Platform

Save

Applied AI Engineer

Apple Inc

Cupertino, CA 24 days ago $181,100–$272,100

Python FastAPI LangChain LLMs GenAI RESTful APIs Vector databases Async programming Pipeline orchestration Prometheus OpenTelemetry Redis RabbitMQ Kafka Docker CI/CD

Save

Senior AI Inference Compiler Engineer

Nvidia

Remote (Santa Clara, CA) 102 days ago $152,000–$241,500

MLIR XLA LLVM PyTorch GPU CUDA C++ Compiler Technologies Deep Learning Models LLM Inference Optimizations High Performance Computing Fast Build Time Kernel Generation Neural Networks Software Engineering

Remote

Save

AI/ML Engineer

Lam Research

Fremont, CA 64 days ago $119,000–$261,000

Python C++ PostgreSQL SQLite MySQL Git Domain-Driven Design Test-Driven Development CI/CD

Hybrid

Save

AI/ML Engineer

Booz Allen Hamilton

Norfolk, VA 3 days ago

Spark Hadoop Databricks Python Java Scala R TensorFlow Keras PyTorch CI/CD MLOps Git Jupyter Notebook PostgreSQL MongoDB AWS Azure Google Cloud Platform Kubernetes Docker

Save