Software Engineer - AI Research Clusters

Nvidia

Remote Actively hiring
Remote · Santa Clara, CA · Westford, MA · Austin, TX · Hillsboro, OR · Remote, CO Posted 17 days ago $124,000$195,500 / year

At a glance

AI generated

TL;DR

As a Software Engineer at NVIDIA’s AI Platform organization, you will join a dynamic team focused on advancing machine learning innovation by designing and implementing robust GPU cluster solutions. Your daily tasks include collaborating with cross-functional teams to identify operational challenges in validating, monitoring, and operating large-scale GPU clusters, then developing engineering solutions that enhance reliability and performance. You will also explore cutting-edge technologies like AIOps and Agentic AI to automate operations further, reducing manual intervention. Key skills required are a BS/MS in Computer Science or equivalent experience, proficiency in Python, C++, or Rust, and hands-on experience with Docker, Kubernetes, GitLab CI, and full-stack development. Additionally, familiarity with GPU computing, Linux systems internals, and performance tuning at scale is essential for this role that demands strong coding skills and a passion for building reliable, user-friendly platforms.

Skills

Python Kubernetes Docker GitLab CI C++ Rust RSQL REST API JavaScript CSS Slurm Linux GPU computing AIOps Agentic AI Terraform Prometheus Grafana CI/CD

What you'll do

  • Design and develop engineering solutions to address operational pain points in GPU clusters.
  • Implement AIOps and Agentic AI technologies to minimize operational overhead in large-scale systems.
  • Maintain and support on-call for platforms and systems built by the team.
  • Research and integrate emerging AI techniques to enhance system reliability and performance.
  • Collaborate with cross-functional teams to optimize GPU cluster validation, monitoring, and operation.

What we're looking for

  • BS/MS in Computer Science or equivalent experience in software engineering.
  • 2+ years of software/platform engineering with 1 year in ML infrastructure.
  • Strong coding skills in Python, C++, or Rust on Linux-based platforms.
  • Experience with Docker, Kubernetes, GitLab CI, and automated deployments.
  • Proficiency in full-stack development including REST APIs and database optimization.
  • Familiarity with GPU computing, Linux systems internals, and performance tuning.
  • Experience running Slurm or custom scheduling frameworks in production environments.

Market check

Salary context

This $124,000–$195,500 range sits above 30% of similar postings on FindRole.

Peer median band

$152,000$230,850

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$154,062$235,750

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 801 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 797 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Senior Software Engineer - AI Research Clusters

Nvidia

Remote (Us, Ca, Santa Clara, US) 29 days ago $152,000$241,500
Python Kubernetes Docker GitLab CI C++ Rust RSQL REST API JavaScript CSS Slurm Linux GPU Computing AIOps Agentic AI CI/CD Prometheus Grafana
Remote

Software Engineer, AI Software Core

Qualcomm

San Diego, Ca,Us, US 11 days ago $158,400$237,600
Python TensorFlow PyTorch C++ Linux Android Machine Learning NLP Multimedia Statistics Probability Embedded Systems Bayesian Methods

AI Software Engineer

Broadcom

Usa-Ga-Atlanta - Perimeter, US 37 days ago $108,000$172,800
Java Spring GitHub Git GitHubActions CI/CD Micrometer OpenTelemetry LargeLanguageModels LLMs VectorDatabases Langchain4J Embable Anthropic OpenAI AmazonBedrock GoogleGenAI AzureOpenAI TanzuPlatform10 Bitnami SpringAI

AI Software Engineer

Booz Allen Hamilton

Locations Arlington, Virginia, US 58 days ago $86,800$198,000
Python Rust Go Scala Java RESTful APIs CI/CD GitLab CI Jenkins Agentic AI solutions Linux Docker AWS LocalStack ESXi Ansible Kubernetes SIEMs Security+ Linux+

Sr. Software Engineer - Applied AI

GEICO

Remote (Ca Palo Alto Office, US) 43 days ago $80,000$215,000
Python LangChain HuggingFace OpenAI Kubernetes CI/CD Docker Prometheus Grafana PostgreSQL Redis Apache Kafka Spring AI LangGraph LangSmith LlamaIndex Anthropic APIs Vector databases Knowledge graphs Java Spring生态系统
Remote

Software Engineer I, AI Specialist

Warner Bros. Discovery

Remote (Ga Atlanta 1050 Techwood Drive Nw, US) 14 days ago
Python LLMs Prompt Engineering Evaluation Frameworks Human-in-the-Loop Workflows AI Evaluation Practices Content Classification Taxonomy Management Information Retrieval Concepts Basic Scripting Skills Cross-Functional Collaboration CI/CD
Remote