Senior Software Engineer - NVLink Rack Scale Stability and Reliability
At a glance
AI generatedTL;DR
As a Senior Software Engineer in the Fabric Networking team, you will focus on enhancing stability and reliability for NVLink Rack-Scale Systems. Your daily tasks include driving platform bringup, feature enablement, software validation, and debugging complex system-level issues across various layers. You’ll develop tools and diagnostics to support large-scale AI infrastructure, lead MTBI validations through stress testing and telemetry analysis, and collaborate with cross-functional teams to improve overall system quality. Essential skills include strong programming in C/C++ and Python, networking fundamentals, experience with NVIDIA GPU systems, and proficiency in server management technologies and data center operations. This role demands expertise in large-scale AI system architecture and a passion for solving complex challenges at scale.
Skills
What you'll do
- Drive platform bringup and feature enablement for next-generation NVLink-based systems.
- Develop tools and infrastructure for system validation and regression testing.
- Lead reliability validation through stress testing and telemetry analysis.
- Triage complex software and hardware issues across various environments.
- Build SRE-style validation infrastructure, including provisioning and monitoring.
- Create automation and dashboards to improve root-cause analysis efficiency.
What we're looking for
- 5+ years of system software or firmware development experience.
- Strong programming skills in C/C++ and Python; Bash/Shell scripting preferred.
- Expertise in networking fundamentals including TCP/IP, Ethernet, InfiniBand, RDMA/RoCE.
- Experience with large-scale AI systems, platform bringup, validation, and reliability engineering.
- Ability to triage complex multi-domain issues using logs, telemetry, and structured debugging methods.
- Strong communication skills for collaboration across engineering, customer, and operations teams.
Employer
About Nvidia
Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing
Nvidia currently has 825 open roles on FindRole.
Listed pay typically runs $184,000–$287,500 across 813 roles with salary data.
Most-posted roles
- Senior Solutions Architect, AI Infrastructure 4
- Senior System Software Engineer - AV Platform 4
- Senior Circuit Design Engineer 3
- Senior Circuit Methodology Engineer 3
- Senior Deep Learning Performance Architect 3