Principal Platform Software Engineer - RAS

Nvidia

Remote

Quick summary

Work type
Remote
Location
Santa Clara, CA
Salary
$272,000–$431,250 / yr
Posted
9 days ago

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $201k
This role $352k
$132k most similar roles pay here $463k

This role pays more than 99% of similar roles. Most pay $171,636–$230,400 — the shaded band above. At the midpoint, this role pays about $352k versus about $201k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 563 open roles on FindRole.

Listed pay typically runs $168,000–$264,500 across 556 roles with salary data.

Most-posted roles

View all roles at Nvidia

At a glance

TL;DR · Principal Platform Software Engineer - RAS

Join our expert team as a senior engineer responsible for designing rack-level solutions for next-generation AI supercomputing platforms using NVIDIA GH200 superchips, focusing on fleet management and health monitoring at scale. You will collaborate closely with customers, product managers, and architects to define requirements, conduct proof-of-concept validations, and write detailed architecture documents while ensuring robust testing and quality assurance processes are in place. The ideal candidate has over 15 years of hands-on coding experience, expertise in C/C++ and Python, and a deep understanding of time series databases like InfluxDB and Prometheus, as well as telemetry visualization solutions such as Grafana. Familiarity with Confidential Compute, ML optimization techniques, and Open Compute Project (OCP) standards is also beneficial for this role that demands strong communication skills and a commitment to delivering high-quality work.

What you'll do

  • Drive the design of fleet management solutions for scaling AI infrastructure using NVIDIA GPUs and Grace.
  • Develop detailed architectures for health monitoring and fault-remediation at scale, conducting POCs to validate designs.
  • Educate customers on product architecture, gather feedback, and write comprehensive documentation for end-to-end delivery.
  • Ensure thorough testing by collaborating with the development team to enhance unit tests and create robust test plans.
  • Manage product life cycles in collaboration with QA teams to ensure code is properly productized as a product owner.
  • Contribute to all phases of product development, from definition through early customer support.

What we're looking for

  • 15+ years of hands-on coding experience in C/C++ and Python
  • Strong knowledge of time series databases, telemetry visualization solutions, and REST APIs
  • Proven record of designing scalable AI infrastructure solutions using GPUs and Grace from Nvidia
  • Experience with firmware architecture optimization and project system resource requirements analysis
  • Expertise in SCM tools (Git, Perforce) and Jira for project management
  • Active contributor to Open Compute Project and DMTF with hands-on x86 or ARM system architecture experience

More like this

Similar roles

Principal Software Engineer, Data Platform

Salesforce

Remote (San Francisco, CA) 40 days ago $197,300$313,700
Snowflake dbt Informatica Airflow Neo4j TopQuadrant Terraform Helm Python Java Go Kafka CI/CD SRE AWS GCP Kubernetes SQL Jinja Cypher Vector databases LLMs RAG architectures
Remote

Principal Software Engineer, Delivery Platform

Snap Inc.

Santa Monica, CA 2 days ago $276,000$414,000
Python Go Kubernetes Docker AWS CI/CD PostgreSQL Redis GraphQL React JavaScript Node.js MongoDB Cassandra Hadoop Spark Kafka Prometheus Grafana

Lead Software Engineer - Platform Services

Salesforce

San Francisco, CA 22 days ago $172,500$260,100
AWS API Gateway Lambda SNS SQS ElastiCache AppSync DynamoDB Neo4j Kafka Kinesis JavaScript TypeScript Node.js JWT CORS CSP XSS Jest Playwright Terraform OAuth 2.0 OpenID Connect GraphQL Microservices Domain Driven Design RESTful APIs CI/CD DevOps AI Coding Assistants Web Security