Senior Software Engineer, RL Post-Training Frameworks

Nvidia

Remote

Quick summary

Work type
Remote
Location
Santa Clara, CA
Salary
$184,000–$287,500 / yr
Posted
46 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $182k
This role $236k
$120k most similar roles pay here $305k

This role pays more than 88% of similar roles. Most pay $142,450–$222,000 — the shaded band above. At the midpoint, this role pays about $236k versus about $182k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 985 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 971 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Senior Software Engineer, RL Post-Training Frameworks

Join NVIDIA’s RL Frameworks engineering team as a senior engineer to develop the open-source tools and infrastructure that enable AI researchers and post-training teams. You will architect and build scalable reinforcement learning infrastructure from single GPU experiments to large-scale production deployments across thousands of nodes, optimizing performance on GPUs, CPUs, and LPUs while contributing to frameworks like VeRL, Miles, and TorchTitan. Your role includes enhancing distributed runtimes such as Ray and Monarch for fault tolerance and elastic scaling, collaborating with hardware teams to leverage next-generation capabilities, and advocating for the needs of researchers and partners within NVIDIA’s ecosystem. Strong proficiency in Python and C/C++, experience with large-scale distributed systems, and depth in reinforcement learning algorithms or PyTorch internals are essential, along with contributions to open-source projects and hands-on experience with production failures at scale.

What you'll do

  • Design and implement scalable RL infrastructure for efficient experimentation and production.
  • Optimize RL training-inference-rollout loops on diverse hardware for performance.
  • Contribute to and enhance open-source RL frameworks like VeRL and TorchTitan.
  • Ensure fault tolerance and elastic scaling in distributed training jobs.
  • Collaborate with teams to integrate CPU-driven rollout workloads efficiently.
  • Advocate for RL workload requirements with NVIDIA's networking and compiler teams.

What we're looking for

  • MS or PhD in Computer Science, Engineering, or related field with 5+ years professional experience.
  • Strong proficiency in Python and C/C++ for building large-scale distributed systems.
  • Experience contributing to open-source RL frameworks like VeRL, Miles, TorchTitan.
  • Deep understanding of reinforcement learning algorithms and their distributed execution challenges.
  • Expertise in Kubernetes runtime internals and end-to-end distributed system design.

More like this

Similar roles

Senior Software Engineer, Platform

Anduril Industries

Costa Mesa, CA 4 days ago $191,000$253,000
Go C++ Python Rust AWS Azure CI/CD Terraform NixOS Kubernetes Docker Prometheus Grafana PostgreSQL MongoDB Redis Git GitHub Jenkins

Senior Software Engineer, Platform

Anduril Industries

Seattle, WA 4 days ago $191,000$253,000
Go C++ Python Rust Java JavaScript TypeScript AWS Azure CI/CD Terraform NixOS Kubernetes Prometheus Grafana PostgreSQL Docker

Senior Software Engineer, Platform

Anduril Industries

Boston, MA 4 days ago $191,000$253,000
Go C++ Python Rust Java TypeScript AWS Azure CI/CD Terraform NixOS Kubernetes Prometheus Grafana

Senior Software Engineer, Infrastructure

Anduril Industries

Washington, District of Columbia 4 days ago $220,000$292,000
Python Kubernetes Docker CI/CD Java C++ Rust Go JavaScript AWS PostgreSQL Terraform ML infrastructure Virtualization Containerization