Principal AI Inference Systems Engineer

Amd

Quick summary

Work type: On-site
Location: Santa Clara, CA
Posted: 107 days ago
Closes: Mar 12, 2027
Nearby: 99+ roles within 25 mi

Market check

Salary context

How this pay compares to similar roles

Similar $210k

$163k most similar roles pay here $259k

This listing doesn't post a salary. Most similar roles pay $173,200–$246,150.

Based on 240 similar postings.

Employer

About Amd

AMD (Advanced Micro Devices) is a semiconductor company that develops high-performance processors, graphics cards, and adaptive computing solutions for gaming, data centers, and embedded markets. Industry: Semiconductors

Amd currently has 56 open roles on FindRole.

Most-posted roles

View all roles at Amd

At a glance

TL;DR · Principal AI Inference Systems Engineer

Role Posting Log in to save

As a Principal AI Infrastructure Solution Engineer at AMD, you will join the AI software team to design and validate Kubernetes architectures for large-scale LLM training and inference on AMD Instinct GPUs. Your daily tasks include architecting distributed training stacks, implementing gang scheduling, and optimizing GPU orchestration using tools like Kubeflow Training Operator and SLURM controllers. You will work closely with enterprise customers to deploy production-ready AMD GPU clusters, benchmark performance, and develop tuning guides for efficient communication and workload-specific optimizations. This role requires expertise in Kubernetes GPU orchestration, distributed training on Kubernetes, and hands-on experience with AI infrastructure at scale, making it ideal for someone with a strong background in deploying large-scale GPU clusters and enabling customers through complex platform deployments.

Skills

Kubernetes SLURM vLLM SGLang MPI Operator Volcano Kueue Kubeflow Training Operator GPU Operator NCCL RCCL RDMA CNI Prometheus Grafana Python CI/CD AMD Instinct GPUs

What you'll do

Design and deliver reference architectures for LLM training and inference on AMD GPUs.
Architect and validate Kubernetes-based distributed training stacks for large-scale LLM workloads.
Define and implement gang scheduling and topology-aware GPU placement for multi-node training.
Enable Kubernetes-native training controllers including Kubeflow Training Operator, MPI Operator, Volcano, and Kueue.
Implement and validate GPU orchestration using Kubernetes GPU Operator, device plugins, metrics exporters.

What we're looking for

Extensive experience in deploying and operating large-scale GPU clusters for production AI training and inference.
Deep expertise in Kubernetes GPU orchestration including operators, device plugins, scheduling, multi-tenancy, and observability.
Hands-on experience with distributed training on Kubernetes using Kubeflow, MPI Operator, Volcano, Kueue, and Ray.
Strong knowledge of gang scheduling, elastic jobs, quotas, priority, and shared GPU environments in AI workloads.
Tuned Kubernetes networking and storage for high-performance AI workloads including RDMA and scalable checkpointing.

Similar roles

Principal AI Inference Systems Engineer

Amd

Santa Clara, CA 82 days ago

Python C/C++ Kubernetes Ray Kubeflow Megatron-LM DeepSpeed PyTorch Distributed NCCL RCCL MPI GPU ROCm HIP Quantization Mixed-Precision TP/PP/DP/ZeRO Profiling Tools Performance-Analysis Tools

Hybrid

Save

AI Inference Performance Engineer

Nvidia

Santa Clara, CA 111 days ago $152,000–$241,500

Python C++ PyTorch JAX TensorRT-LLM vLLM SGLang CUDA MPI NCCL K8S CUTLASS cuteDSL tilelang OpenAI_Triton torch.compile GPU FPGA roofline_analysis performance_profiling

Hybrid

Save

Principal AI Engineer

Salesforce

New York +4 31 days ago $218,400–$365,200

Salesforce Distributed Systems CI/CD Infrastructure-as-Code API Integration AI Agents LLM Workflows Automated Testing Observability Event-Driven Design Microservices Security & Compliance Prompt Engineering System Context Design Evaluation Frameworks GitHub Copilot Claude Code Cursor Salesforce Marketing Cloud Agentforce Google Workspace Slack

Save

Principal AI Engineer

Salesforce

Remote (San Francisco, CA) +4 30 days ago $197,300–$313,700

AWS Python GitHub Actions ArgoCD Terraform Docker Kubernetes Grafana Braintrust LangSmith CI/CD AgentOps Salesforce Ecosystem Vector Databases Graph Databases RAG Pipelines Snowflake Kafka Flink

Remote

Save

Principal AI Performance Engineer

Amd

San Jose, CA 108 days ago

Python C++ vLLM SGLang TensorRT-LLM HIP CUDA Triton CK Linux GPU AI agents CI/CD PyTorch Kubernetes

Hybrid

Save