Principal AI Inference Systems Engineer

Amd

Quick summary

Work type
On-site
Location
Santa Clara, CA
Posted
107 days ago
Closes
Mar 12, 2027

Market check

Salary context

How this pay compares to similar roles

Similar $210k
$163k most similar roles pay here $259k

This listing doesn't post a salary. Most similar roles pay $173,200–$246,150.

Based on 240 similar postings.

Employer

About Amd

AMD (Advanced Micro Devices) is a semiconductor company that develops high-performance processors, graphics cards, and adaptive computing solutions for gaming, data centers, and embedded markets. Industry: Semiconductors

Amd currently has 56 open roles on FindRole.

Most-posted roles

View all roles at Amd

At a glance

TL;DR · Principal AI Inference Systems Engineer

As a Principal AI Infrastructure Solution Engineer at AMD, you will join the AI software team to design and validate Kubernetes architectures for large-scale LLM training and inference on AMD Instinct GPUs. Your daily tasks include architecting distributed training stacks, implementing gang scheduling, and optimizing GPU orchestration using tools like Kubeflow Training Operator and SLURM controllers. You will work closely with enterprise customers to deploy production-ready AMD GPU clusters, benchmark performance, and develop tuning guides for efficient communication and workload-specific optimizations. This role requires expertise in Kubernetes GPU orchestration, distributed training on Kubernetes, and hands-on experience with AI infrastructure at scale, making it ideal for someone with a strong background in deploying large-scale GPU clusters and enabling customers through complex platform deployments.

What you'll do

  • Design and deliver reference architectures for LLM training and inference on AMD GPUs.
  • Architect and validate Kubernetes-based distributed training stacks for large-scale LLM workloads.
  • Define and implement gang scheduling and topology-aware GPU placement for multi-node training.
  • Enable Kubernetes-native training controllers including Kubeflow Training Operator, MPI Operator, Volcano, and Kueue.
  • Implement and validate GPU orchestration using Kubernetes GPU Operator, device plugins, metrics exporters.

What we're looking for

  • Extensive experience in deploying and operating large-scale GPU clusters for production AI training and inference.
  • Deep expertise in Kubernetes GPU orchestration including operators, device plugins, scheduling, multi-tenancy, and observability.
  • Hands-on experience with distributed training on Kubernetes using Kubeflow, MPI Operator, Volcano, Kueue, and Ray.
  • Strong knowledge of gang scheduling, elastic jobs, quotas, priority, and shared GPU environments in AI workloads.
  • Tuned Kubernetes networking and storage for high-performance AI workloads including RDMA and scalable checkpointing.

More like this

Similar roles

Principal AI Inference Systems Engineer

Amd

Santa Clara, CA 82 days ago
Python C/C++ Kubernetes Ray Kubeflow Megatron-LM DeepSpeed PyTorch Distributed NCCL RCCL MPI GPU ROCm HIP Quantization Mixed-Precision TP/PP/DP/ZeRO Profiling Tools Performance-Analysis Tools
Hybrid

AI Inference Performance Engineer

Nvidia

Santa Clara, CA 111 days ago $152,000$241,500
Python C++ PyTorch JAX TensorRT-LLM vLLM SGLang CUDA MPI NCCL K8S CUTLASS cuteDSL tilelang OpenAI_Triton torch.compile GPU FPGA roofline_analysis performance_profiling
Hybrid

Principal AI Engineer

Salesforce

New York +4 31 days ago $218,400$365,200
Salesforce Distributed Systems CI/CD Infrastructure-as-Code API Integration AI Agents LLM Workflows Automated Testing Observability Event-Driven Design Microservices Security & Compliance Prompt Engineering System Context Design Evaluation Frameworks GitHub Copilot Claude Code Cursor Salesforce Marketing Cloud Agentforce Google Workspace Slack

Principal AI Engineer

Salesforce

Remote (San Francisco, CA) +4 30 days ago $197,300$313,700
AWS Python GitHub Actions ArgoCD Terraform Docker Kubernetes Grafana Braintrust LangSmith CI/CD AgentOps Salesforce Ecosystem Vector Databases Graph Databases RAG Pipelines Snowflake Kafka Flink
Remote