Principal AI Inference Systems Engineer in Santa Clara, California | Advanced Micro Devices, Inc

Amd

Actively hiring
Santa Clara, CA Posted 81 days ago $237,200$237,200 / year

At a glance

AI generated

TL;DR

As a Principal AI Infrastructure Solution Engineer at AMD, you will join the AI software team to design and validate Kubernetes architectures for large-scale LLM training and inference on AMD Instinct GPUs. Your daily tasks include architecting distributed training stacks, implementing gang scheduling, and optimizing GPU orchestration using tools like Kubeflow Training Operator and SLURM controllers. You will work closely with enterprise customers to deploy production-ready AMD GPU clusters, benchmark performance, and develop tuning guides for efficient communication and workload-specific optimizations. This role requires expertise in Kubernetes GPU orchestration, distributed training on Kubernetes, and hands-on experience with AI infrastructure at scale, making it ideal for someone with a strong background in deploying large-scale GPU clusters and enabling customers through complex platform deployments.

Skills

Kubernetes SLURM vLLM SGLang MPI Operator Volcano Kueue Kubeflow Training Operator GPU Operator NCCL RCCL RDMA CNI Prometheus Grafana Python CI/CD AMD Instinct GPUs

What you'll do

  • Design and deliver reference architectures for LLM training and inference on AMD GPUs.
  • Architect and validate Kubernetes-based distributed training stacks for large-scale LLM workloads.
  • Define and implement gang scheduling and topology-aware GPU placement for multi-node training.
  • Enable Kubernetes-native training controllers including Kubeflow Training Operator, MPI Operator, Volcano, and Kueue.
  • Implement and validate GPU orchestration using Kubernetes GPU Operator, device plugins, metrics exporters.

What we're looking for

  • Extensive experience in deploying and operating large-scale GPU clusters for production AI training and inference.
  • Deep expertise in Kubernetes GPU orchestration including operators, device plugins, scheduling, multi-tenancy, and observability.
  • Hands-on experience with distributed training on Kubernetes using Kubeflow, MPI Operator, Volcano, Kueue, and Ray.
  • Strong knowledge of gang scheduling, elastic jobs, quotas, priority, and shared GPU environments in AI workloads.
  • Tuned Kubernetes networking and storage for high-performance AI workloads including RDMA and scalable checkpointing.

Market check

Salary context

This $237,200–$237,200 range sits above 90% of similar postings on FindRole.

Peer median band

$120,500$234,000

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$145,450$214,500

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Amd

AMD (Advanced Micro Devices) is a semiconductor company that develops high-performance processors, graphics cards, and adaptive computing solutions for gaming, data centers, and embedded markets. Industry: Semiconductors

Amd currently has 69 open roles on FindRole.

Listed pay typically runs $187,760–$187,760 across 69 roles with salary data.

Most-posted roles

View all roles at Amd

More like this

Similar roles

Senior, Software Engineer

Walmart

Sunnyvale, CA 34 days ago $117,000$234,000
Python Java Kafka Docker CI/CD Microservices APIs Video Streaming Real-time Analytics Kubernetes PostgreSQL AWS Azure Git Jenkins Prometheus Grafana Terraform Open-source Libraries SDLC Secure Coding

Senior, Software Engineer

Walmart

Sunnyvale, CA 49 days ago $117,000$234,000
Java Python Azure Cosmos DB CI/CD Kubernetes Docker Terraform GenAI tools PostgreSQL AWS Git J2EE Swagger/OpenAPI JUnit Selenium SonarQube Maven Gradle Spring Boot Hibernate

Senior, Software Engineer

Walmart

Sunnyvale, CA 77 days ago $117,000$234,000
Kotlin Android SDK Gradle Dagger REST GraphQL MockK Google Truth Robolectric Espresso MVVM MVP Clean Architecture Git CI/CD Multithreading Networking Offline Storage Performance Tuning
Hybrid

Senior, Software Engineer

Walmart

Sunnyvale, CA 1 day ago
Java Rust NodeJS GraphQL Apollo Federation Framework React TypeScript Kubernetes Docker CI/CD Prometheus Grafana AWS Azure Google Cloud Platform PostgreSQL MongoDB Redis Git Jenkins Terraform Kafka
Hybrid

Senior, Software Engineer

Walmart

Sunnyvale, CA 34 days ago $117,000$234,000
Python Java Kafka Docker CI/CD Microservices APIs Video Streaming Real-time Analytics Kubernetes PostgreSQL AWS Azure Git Jenkins Prometheus Grafana Terraform Scalability Security Telemetry

Senior, Software Engineer

Walmart

Bentonville, AR 1 day ago
Java Spring Boot REST Services Cloud technologies Agile methodology Docker Kubernetes CI/CD PostgreSQL MySQL AWS Terraform