Senior Deep Learning Communication Architect

Nvidia

Actively hiring Verified listing
Santa Clara, US · Austin, US Posted 10 days ago $184,000$287,500 / year

At a glance

AI generated

TL;DR

NVIDIA’s software architecture group seeks a Deep Learning Communication Architect to join their team at a senior level, focusing on scaling deep learning models across large-scale systems with hundreds of thousands of nodes. This role involves identifying and eliminating communication bottlenecks in distributed training and inference, designing efficient protocols for high-speed interconnects like NVLink and InfiniBand, and collaborating closely with hardware teams to optimize performance. The ideal candidate will have extensive experience in deep learning frameworks such as PyTorch and TensorRT-LLM, strong programming skills in C++ and Python, and a deep understanding of parallelism techniques including Data Parallelism and Pipeline Parallelism. They should also be familiar with GPU computing technologies like CUDA and OpenCL, and possess prior contributions to DNN training and inference frameworks.

Skills

PyTorch TensorRT-LLM vLLM SGLang C++ Python CUDA OpenCL InfiniBand RoCE MPI NCCL UCX UCC NVSHMEM Data Parallelism Pipeline Parallelism Tensor Parallelism Expert Parallelism FSDP Disaggregated Serving Dynamo Triton

What you'll do

  • Identify and eliminate bottlenecks in data transfer during distributed deep learning training.
  • Design communication protocols tailored for deep learning workloads to minimize overhead.
  • Collaborate with hardware teams to apply high-speed interconnects effectively.
  • Research new communication technologies to enhance performance of deep learning systems.
  • Develop proofs-of-concept and conduct experiments to validate new communication strategies.

What we're looking for

  • Ph.D., Masters, or BS in CS, EE, CSEE, or related field with 6+ years experience.
  • Expertise in evaluating and optimizing LLM training and inference performance on advanced hardware.
  • Deep understanding of parallelism techniques for DNN frameworks and large-scale systems.
  • Proficiency in developing code for PyTorch, TensorRT-LLM, vLLM, SGLang, and other DNN frameworks.
  • Strong programming skills in C++ and Python with experience in GPU computing (CUDA/OpenCL).
  • Familiarity with high-speed interconnects like NVLink, InfiniBand, and communication libraries.

Market check

Salary context

This $184,000–$287,500 range sits above 72% of similar postings on FindRole.

Peer median band

$183,300$262,400

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$185,250$246,150

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Principal Deep Learning Communication Architect

Nvidia

Remote (Us, Ca, Santa Clara, US) 46 days ago $272,000$431,250
NCCL UCX UCC NVSHMEM MPI RDMA RoCE InfiniBand TensorRT-LLM vLLM SGLang NVIDIA Dynamo CUDA Megatron-Core DeepSpeed JAX XLA PyTorch Distributed KServe Ray
Remote

Senior Deep Learning Framework Communications Engineer

Nvidia

Remote (Us, Ca, Santa Clara, US) 12 days ago $152,000$241,500
PyTorch C++ CUDA Python NCCL NVSHMEM JAX TRT-LLM vLLM SGLang HPC AI MPI TensorRT NVIDIA_Nsight_Systems Performance_Profiling Parallel_Programming Compiler_Technologies Memory_Hierarchy Tensor_Layout Distributed_Inference Mixture_of_Experts Reinforcement_Learning
Remote

Senior Deep Learning Performance Architect

Nvidia

Us, Ca, Santa Clara, US 25 days ago $184,000$287,500
Python C++ GPU ASIC Deep Learning LLM Batching KV-cache Latency/Tuning Multi-node Scaling Memory Hierarchy Scalability System Architecture Performance Tuning Profiling Debugging

Senior Deep Learning Performance Architect

Nvidia

Us, Ca, Santa Clara, US 141 days ago $184,000$287,500
Python C C++ Pytorch JAX TensorRT CUDNN CUBLAS CUTLASS MLIR Triton CUDA OpenCL GPU Deep Learning ASIC Performance Modeling Architecture Simulation Profiling Analysis

Senior Deep Learning Performance Architect

Nvidia

Us, Ca, Santa Clara, US 141 days ago $184,000$287,500
Python C++ GPU Deep_Learning ASIC Transformer_Models Computer_Architecture Interconnect_Fabrics Parallel_Computing AI_Algorithms

Senior Deep Learning Software Engineer

Nvidia

US 86 days ago $224,000$356,500
Python PyTorch JAX CUDA TensorRT NVIDIA_TensorRT_LLM GPU_optimization CUTLASS Triton Deep_learning_frameworks Performance_analysis GPU_architecture High_performance_computing Model_inference Inference_optimization