Senior HPC and LSF Operations Engineer

Nvidia

Hybrid Actively hiring
Santa Clara, CA · Westford, MA · Austin, TX · Durham, NC Posted 84 days ago $152,000$241,500 / year

At a glance

AI generated

TL;DR

As a Senior Infrastructure Engineer on the Hardware Infrastructure EDA Compute team, you will manage and optimize large-scale job scheduling systems like LSF and Slurm to enhance design velocity and infrastructure efficiency. Your daily tasks include analyzing performance data to identify bottlenecks, leading problem-solving efforts across multiple layers, and implementing automation to reduce manual effort. You will also define metrics for service reliability, contribute to operational standards, and partner with customer teams to clarify requirements and drive issue resolution. The ideal candidate has at least five years of experience in large-scale Linux-based compute infrastructure management, proficiency in systems administration, and expertise in job scheduling system tuning and troubleshooting. Additionally, familiarity with container technologies and observability systems is beneficial, as you will work within a high-performance computing environment to address evolving compute demands.

Skills

LSF Slurm Linux CentOS RHEL Docker Singularity Podman HPC Reliability Engineering Metrics Collection Monitoring Pipelines Alerting Strategies Performance Dashboards Container Technologies Job Scheduling Systems

What you'll do

  • Manage and optimize job scheduling systems to enhance utilization and throughput.
  • Analyze performance data to identify and resolve systemic bottlenecks efficiently.
  • Implement automation and process improvements to reduce manual effort.
  • Define and track metrics for service reliability and performance.
  • Contribute to operational standards and best practices documentation.
  • Partner with customer teams to clarify requirements and drive issue resolution.

What we're looking for

  • 5+ years experience operating large-scale Linux-based compute infrastructure.
  • Strong hands-on skills supporting and tuning job scheduling systems like LSF and Slurm.
  • Proficiency in Linux systems administration with CentOS/RHEL.
  • Expertise in advanced troubleshooting techniques for complex system issues.
  • Experience implementing reliability engineering practices in HPC environments.

Market check

Salary context

Competitive pay

How this pay compares to similar roles

Similar $187k
This role $197k
$138k most similar roles pay here $253k

This role pays more than 61% of similar roles. Most pay $149,807–$223,700 — the shaded band above. At the midpoint, this role pays about $197k versus about $187k for comparable roles.

Based on 240 similar postings.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 824 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 812 roles with salary data.

Most-posted roles

View all roles at Nvidia

More like this

Similar roles

Senior HPC Cluster Engineer

Nvidia

Santa Clara, CA 84 days ago $152,000$241,500
Slurm Kubernetes Python Bash Docker Enroot Prometheus Grafana Linux RHEL Ubuntu MPI NCCL CUDA NVIDIA_GPUs InfiniBand RDMA RoCE Lustre GPFS Ansible MLPerf

HPC Engineer

Arm Holdings

Austin, TX 22 days ago $130,100$176,000
Python Bash Kubernetes Docker AWS GCP Azure Terraform Ansible IBM Spectrum LSF Prometheus Grafana CI/CD DevOps SRE Infrastructure as Code Slurm Jira Confluence
Hybrid

Senior HPC Performance Engineer

Nvidia

Remote (OR) 47 days ago $184,000$287,500
Fortran C C++ OpenACC OpenMP MPI CUDA Performance_analysis Parallel_programming Linear_algebra Numerical_methods Assembly_language Debugging Porting
Remote

Senior HPC Storage Engineer

Nvidia

Santa Clara, CA 73 days ago $184,000$287,500
Python Docker Ceph Weka.io Vast Lustre GPFS CUDA NCCL PyTorch TensorFlow Bash CentOS RHEL Ubuntu SDN MLPerf NVIDIA GPUs HDDs SSDs NVMe

Senior HPC Storage Architect & Engineer

Lam Research

Fremont, CA 141 days ago $114,000$253,000
Lustre GPFS/Spectrum Scale VAST Data WEKA NetApp ONTAP FlexCache AWS Azure GCP InfiniBand RoCE NVMe-over-Fabrics SLURM xCAT Warewulf Ansible Terraform Python YAML Kubernetes CSI S3 IaC CI/CD
Hybrid

Senior HPC and Quantum Systems Engineer

Nvidia

Remote (Westford, MA) 145 days ago $224,000$356,500
Linux Slurm CUDA-Q cuQuantum NVQlink Python C++ NVIDIA HPC GPU Neutral atom Trapped ion Superconducting Photonic Qiskit Cirq PennyLane Braket Networking Storage Data center infrastructure Quantum computing
Remote