Senior Deep Learning Performance Architect

Nvidia

Actively hiring
Us, Ca, Santa Clara, US Posted 139 days ago $184,000$287,500 / year

At a glance

AI generated

TL;DR

NVIDIA is seeking a Senior Deep Learning Performance Architect to join its cutting-edge Deep Learning Architecture team. This role involves developing advanced HW architectures that enhance parallel computing performance, energy efficiency, and programmability for AI applications. The candidate will benchmark AI workloads, develop simulation tools in C++/Python, evaluate PPA metrics, collaborate with product management, and stay updated on deep learning trends. Ideal candidates hold an MS or PhD in a relevant field and have 4+ years of experience in parallel computing, GPU architecture evaluation, and deep learning applications. Strong skills in Python and C++, along with knowledge of computer architecture and transformer models, are essential. The role requires the ability to communicate complex technical concepts clearly and solve intricate problems within the fast-evolving AI industry.

Skills

Python C++ GPU Deep_Learning ASIC Transformer_Models Computer_Architecture Interconnect_Fabrics Parallel_Computing AI_Algorithms

What you'll do

  • Develop innovative HW architectures to enhance parallel computing performance.
  • Benchmark and analyze AI workloads across single and multi-node setups.
  • Create high-level simulator tools using C++/Python for analysis purposes.
  • Evaluate PPA metrics for hardware features and system architecture choices.
  • Guide product development by collaborating with peer teams and management.

What we're looking for

  • MS or PhD in Computer Science, Electrical Engineering, or related field.
  • 4+ years of experience in parallel computing architectures and deep learning applications.
  • Experience with GPU or Deep Learning ASIC architecture evaluation for training and inference.
  • Strong programming skills in Python and C++.
  • Solid understanding of computer architecture and interconnect fabrics.

Market check

Salary context

This $184,000–$287,500 range sits above 75% of similar postings on FindRole.

Peer median band

$181,087$262,400

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$185,162$240,225

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 802 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 798 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Senior Deep Learning Performance Architect

Nvidia

Us, Ca, Santa Clara, US 139 days ago $184,000$287,500
Python C C++ Pytorch JAX TensorRT CUDNN CUBLAS CUTLASS MLIR Triton CUDA OpenCL GPU Deep Learning ASIC Performance Modeling Architecture Simulation Profiling Analysis

Senior Deep Learning Performance Architect

Nvidia

Us, Ca, Santa Clara, US 23 days ago $184,000$287,500
Python C++ GPU ASIC Deep Learning LLM Batching KV-cache Latency/Tuning Multi-node Scaling Memory Hierarchy Scalability System Architecture Performance Tuning Profiling Debugging

Senior Deep Learning Performance Architect - LPU

Nvidia

Remote (Us, Ca, Remote, US) 10 days ago $152,000$241,500
Python C C++ CUDA MPI OpenMP HPC GPU Deep Learning Machine Learning Performance Modeling Systems Performance Analysis AI Inference Workloads CUDA Kernels Custom ASIC Hardware
Remote

Senior Deep Learning Computer Architect

Nvidia

Us, Ca, Santa Clara, US 139 days ago $184,000$287,500
C++ Python CUDA PyTorch GPU ComputerArchitecture DeepLearningKernels LLMWorkloads PerformanceAnalysis ParallelizationStrategies FusionStrategies

Senior Deep Learning Communication Architect

Nvidia

Us, Ca, Santa Clara, US 8 days ago $184,000$287,500
PyTorch TensorRT-LLM vLLM SGLang C++ Python CUDA OpenCL InfiniBand RoCE MPI NCCL UCX UCC NVSHMEM Data Parallelism Pipeline Parallelism Tensor Parallelism Expert Parallelism FSDP Disaggregated Serving Dynamo Triton