Manager, Site Reliability Engineering

US, CA, Santa Clara, United States of America Posted 4 hours, 40 minutes ago

$208000 - $333500/year

Job Description

NVIDIA is the leading artificial intelligence computing company and is paving the way with innovations in self-driving cars, machine learning, supercomputing, gaming and visualization. NVIDIA gives automakers, tier-1 suppliers, automotive research institutions, and start-ups the power and flexibility to develop and deploy breakthrough artificial intelligence systems for self-driving vehicles. We are developing the software and driving the processes for software development. We are looking for a seasoned and experienced SRE manager to drive the Infrastructure and Operations team

What you’ll be doing:

You will be leading the team of site reliability engineers responsible for automating maintenance of 10000+ hosts and providing support to customers towards debugging workflows
Responsible for maintaining service level SLA’s
You should be someone that is passionate for continuous improvements by driving critical metrics towards customer responsiveness and delivering to service level agreements
Reuse AI techniques and data analytics to extract useful signals about machines and jobs to ensure high availability and resiliency of the systems in the data center
Take part in prototyping, designing and developing cloud infrastructure for Nvidia.

What we need to see:

Solid programming background in python and/or relevant scripting languages
Experience of maintaining large scale cloud infrastructure applications
Excellent debugging and problem solving skills
Is an extraordinary teammate that can collaborate well across time zones
Proven track record of delivering solutions using Agile process and methodologies
BS/MS in Computer Science, Computer Engineering or equivalent experience
8+ overall years of industry experience with at least 2+ years of people management experience

Ways to stand out from the crowd:

Previous experience in managing and leading small engineering teams
Experience with using and improving data centers
Experience with computer algorithms and ability to choose best possible algorithms to meet the scaling challenge
Ability to divide complex problems into simple sub problems and then reuse available solutions to implement most of those.
Design simple systems that can work reliably without needing much support.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most brilliant and talented people in the world working for us. If you're creative and autonomous, we want to hear from you!

LI-Hybrid

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 208,000 USD - 333,500 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until March 26, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

For more details click Job Post.

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

View All Jobs →