Principal Developer, AI Networking

Nvidia

Remote

Quick summary

Work type
Remote
Location
Santa Clara, CA
Salary
$272,000–$431,250 / yr
Posted
4 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $200k
This role $352k
$132k most similar roles pay here $463k

This role pays more than 99% of similar roles. Most pay $163,987–$236,675 — the shaded band above. At the midpoint, this role pays about $352k versus about $200k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 980 open roles on FindRole.

Listed pay typically runs $168,000–$270,250 across 966 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Principal Developer, AI Networking

As a senior software engineer in the AI Networking Codesign and Benchmarking R&D group, you will focus on profiling, analyzing, and optimizing AI workloads on large-scale GPU and CPU clusters used for distributed Deep Learning LLM training and inference. Your day-to-day responsibilities include characterizing AI workloads, benchmarking performance to identify bottlenecks, developing PyTorch trace-based profiling tools, and collaborating with hardware and software teams to provide critical insights into system performance. You will need extensive experience in high-performance networking technologies like RDMA, MPI, NCCL, and SHARP, along with proficiency in Python, Bash, and C++. Additionally, expertise in NVIDIA GPUs, CUDA, PyTorch, and deep learning frameworks is essential for this role, which operates at the cutting edge of AI innovation.

What you'll do

  • Profile and analyze AI workloads on large-scale GPU clusters for LLM training.
  • Identify performance bottlenecks in distributed systems with a focus on networking.
  • Develop PyTorch-based profiling tools to optimize network system performance.
  • Define performance test plans and set expectations for new technologies.
  • Engage with hardware and software teams to provide performance analysis insights.

What we're looking for

  • Over 15 years of experience in high-performance networking technologies (RDMA, MPI, NCCL).
  • Expertise in performance evaluation techniques and deep learning frameworks (TensorFlow, PyTorch).
  • Proficiency with NVIDIA GPUs, CUDA library, and collective communication libraries like NCCL.
  • Strong analytical skills and hands-on experience with AI workloads and benchmarking for LLM training.
  • Experience developing tools using Python, Bash, and C++ in a container-based environment.
  • Comprehensive knowledge of system components including CPUs (Intel/AMD/ARM), GPUs, HCA, memory, PCI.
  • Ability to collaborate effectively across hardware and software teams.

More like this

Similar roles

Senior Software Engineer, AI Networking

Nvidia

Santa Clara, CA +1 33 days ago $152,000$241,500
Python PyTorch TensorFlow JAX CUDA NCCL Reinforcement_Learning Bayesian_Optimization GNNs Docker Kubernetes CI/CD Prometheus Grafana Bash C++ PostgreSQL Redis

Principal Software Engineer, AI Networking

Nvidia

Remote (Santa Clara, CA) 103 days ago $272,000$431,250
C C++ RoCE InfiniBand DPDK DOCA RDMA NCCL CUDA Congestion_Control Packet_Drops QoS Buffer_Management Distributed_Systems High_Performance_Networking System_Level_Debugging Automation Telemetry CI/CD
Remote

Principal Network Developer, AI Infrastructure

Oracle

Austin, TX +1 7 days ago $109,200$223,400
Oracle Cloud Infrastructure Python Networking Protocols Automation Scripts CI/CD Monitoring Systems Docker Kubernetes Terraform PostgreSQL MySQL Cisco RFP Development Vendor Management Technical Coaching

Principal Architect, AI Networking

Nvidia

Remote (Santa Clara, CA) +1 54 days ago $272,000$431,250
C C++ Rust Python CUDA InfiniBand RoCE RDMA NVLink NIXL NCCL UCX MPI NVSHMEM vLLM SGLang TensorRT-LLM ML systems concepts High-performance networking
Remote

Senior Software Engineer, AI Networking

Nvidia

Austin, TX +1 86 days ago $184,000$287,500
C C++ RDMA verbs DPDK DOCA NCCL CUDA InfiniBand RoCE Docker Kubernetes AWS CI/CD Prometheus Grafana Python PostgreSQL