Senior Product Manager, AI Factory Infra

Nvidia

Hybrid Actively hiring Posted this week
Santa Clara, CA · New York, NY · Seattle, WA Posted 5 days ago $208,000$327,750 / year

At a glance

AI generated

TL;DR

As a Product Manager at NVIDIA’s AI Factory team, you will lead the development of resilient automation systems that maintain AI infrastructure at scale. Your role involves crafting the strategic direction and roadmap for break-fix automation across multiple vendors and cloud service providers, ensuring seamless integration with failure attribution and automated repair actions. You will define critical thresholds and intervention points to balance speed and safety while enhancing operator UX through clear workflow transparency and audit trails. Additionally, you will collaborate closely with NCP operators, SRE teams, and hardware vendor partners to optimize repair workflows and integrate RMA processes at scale. This position requires extensive experience in product management within infrastructure or MLOps areas, expertise in distributed systems and workflow orchestration, and a strong background in reliability engineering and chaos testing.

Skills

AWS Kubernetes Terraform Docker CI/CD Prometheus Grafana Python PostgreSQL MLOps GPU Reliability Engineering SLO Chaos Engineering Agentic AI Workflow Orchestration RMA Vendor SLA Oversight

What you'll do

  • Define automation confidence thresholds to balance speed with operational safety.
  • Build operator UX for repair queues and workflow transparency to ensure quick actions.
  • Drive integration between failure detection and automated repair resolution processes.
  • Define repair SLOs and own metrics framework for fleet availability and recovery times.
  • Collaborate with NCP operators, SRE teams, and hardware vendor partners on RMA processes.

What we're looking for

  • Over 12 years of product management experience in infrastructure or MLOps areas.
  • BS or MS in Computer Science, Engineering, or related technical field.
  • Expertise in distributed systems and workflow orchestration with safety considerations.
  • Proven track record owning products with significant operational impact.
  • Strong skills in designing operator UX for complex system states under pressure.
  • Experience collaborating across engineering, SRE, and external vendor teams.

Employer

About Nvidia

Nvidia is a leading designer of graphics processing units (GPUs) and system-on-chip units, powering gaming, professional visualization, data centers, and artificial intelligence workloads. Industry: Semiconductors & AI Computing

Nvidia currently has 825 open roles on FindRole.

Listed pay typically runs $184,000–$287,500 across 813 roles with salary data.

Most-posted roles

View all roles at Nvidia