Senior Software Engineer - NVLink Rack Scale Stability and Reliability

Nvidia

Remote Actively hiring Posted this week
Santa Clara, CA Posted 4 days ago $152,000$241,500 / year

At a glance

AI generated

TL;DR

As a Senior Software Engineer in the Fabric Networking team, you will focus on enhancing stability and reliability for NVLink Rack-Scale Systems. Your daily tasks include driving platform bringup, feature enablement, software validation, and debugging complex system-level issues across various layers. You’ll develop tools and diagnostics to support large-scale AI infrastructure, lead MTBI validations through stress testing and telemetry analysis, and collaborate with cross-functional teams to improve overall system quality. Essential skills include strong programming in C/C++ and Python, networking fundamentals, experience with NVIDIA GPU systems, and proficiency in server management technologies and data center operations. This role demands expertise in large-scale AI system architecture and a passion for solving complex challenges at scale.

Skills

Python C/C++ Bash NVIDIA GPU NVLink NVSwitch CUDA TCP/IP Ethernet InfiniBand RDMA/RoCE PCIe memory hierarchy DMA high-speed interconnects distributed training/inference systems server management technologies data center operations cluster provisioning CI/CD dashboards SRE telemetry analysis

What you'll do

  • Drive platform bringup and feature enablement for next-generation NVLink-based systems.
  • Develop tools and infrastructure for system validation and regression testing.
  • Lead reliability validation through stress testing and telemetry analysis.
  • Triage complex software and hardware issues across various environments.
  • Build SRE-style validation infrastructure, including provisioning and monitoring.
  • Create automation and dashboards to improve root-cause analysis efficiency.

What we're looking for

  • 5+ years of system software or firmware development experience.
  • Strong programming skills in C/C++ and Python; Bash/Shell scripting preferred.
  • Expertise in networking fundamentals including TCP/IP, Ethernet, InfiniBand, RDMA/RoCE.
  • Experience with large-scale AI systems, platform bringup, validation, and reliability engineering.
  • Ability to triage complex multi-domain issues using logs, telemetry, and structured debugging methods.
  • Strong communication skills for collaboration across engineering, customer, and operations teams.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 825 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 813 roles with salary data.

Most-posted roles

View all roles at Nvidia