Principal AI Inference Systems Engineer

Amd

Hybrid

Quick summary

Work type
Hybrid
Location
Santa Clara, CA
Posted
82 days ago
Closes
Apr 6, 2027

Market check

Salary context

How this pay compares to similar roles

Similar $210k
$163k most similar roles pay here $259k

This listing doesn't post a salary. Most similar roles pay $173,200–$246,150.

Based on 240 similar postings.

Employer

About Amd

AMD (Advanced Micro Devices) is a semiconductor company that develops high-performance processors, graphics cards, and adaptive computing solutions for gaming, data centers, and embedded markets. Industry: Semiconductors

Amd currently has 56 open roles on FindRole.

Most-posted roles

View all roles at Amd

At a glance

TL;DR · Principal AI Inference Systems Engineer

AMD's Llama team seeks a Senior Staff AI Infra Engineer to lead technical initiatives and provide architectural guidance for optimizing AI/ML workloads on AMD GPUs. This role involves enhancing the performance of Large Language Models (LLMs) and Agentic AI systems through kernel, communication, and system-level optimizations. The ideal candidate will have 5+ years of experience in AI infrastructure and distributed systems, with expertise in C/C++ and Python, as well as a deep understanding of transformer architectures and frameworks like Megatron-LM and PyTorch Distributed. Familiarity with GPU architecture and tools such as ROCm, NCCL, and Kubernetes is essential, alongside strong problem-solving and communication skills to drive technical excellence and foster collaboration across teams.

What you'll do

  • Lead technical initiatives and provide architectural guidance for AI/ML infrastructure.
  • Optimize LLM training and inference on AMD GPUs to enhance system efficiency.
  • Develop infrastructure supporting Large Language Models (LLMs) and Agentic AI systems.
  • Design and optimize AI workloads on GPU clusters, including large-scale orchestration.
  • Debug and resolve complex performance issues across GPU, network, and runtime layers.
  • Drive technical excellence and foster innovation within the organization.

What we're looking for

  • 5+ years of experience in AI/ML infrastructure and performance-critical software development.
  • Expert proficiency in C/C++ and Python for AI/ML projects.
  • Solid understanding of transformer architectures and distributed training frameworks like Megatron-LM, DeepSpeed, PyTorch Distributed.
  • Proven experience optimizing LLM training and inference pipelines with parallelism techniques.
  • Hands-on experience designing and scaling AI platforms using Kubernetes, Ray, or Kubeflow.
  • Familiarity with GPU architecture and communication libraries for multi-GPU training optimization.
  • Demonstrated technical ownership and strong problem-solving skills in delivering end-to-end AI/ML solutions.

More like this

Similar roles

Principal AI Inference Systems Engineer

Amd

Santa Clara, CA 107 days ago
Kubernetes SLURM vLLM SGLang MPI Operator Volcano Kueue Kubeflow Training Operator GPU Operator NCCL RCCL RDMA CNI Prometheus Grafana Python CI/CD AMD Instinct GPUs

AI Inference Performance Engineer

Nvidia

Santa Clara, CA 111 days ago $152,000$241,500
Python C++ PyTorch JAX TensorRT-LLM vLLM SGLang CUDA MPI NCCL K8S CUTLASS cuteDSL tilelang OpenAI_Triton torch.compile GPU FPGA roofline_analysis performance_profiling
Hybrid

Principal AI Engineer

Salesforce

New York +4 31 days ago $218,400$365,200
Salesforce Distributed Systems CI/CD Infrastructure-as-Code API Integration AI Agents LLM Workflows Automated Testing Observability Event-Driven Design Microservices Security & Compliance Prompt Engineering System Context Design Evaluation Frameworks GitHub Copilot Claude Code Cursor Salesforce Marketing Cloud Agentforce Google Workspace Slack

Principal AI Engineer

Salesforce

Remote (San Francisco, CA) +4 30 days ago $197,300$313,700
AWS Python GitHub Actions ArgoCD Terraform Docker Kubernetes Grafana Braintrust LangSmith CI/CD AgentOps Salesforce Ecosystem Vector Databases Graph Databases RAG Pipelines Snowflake Kafka Flink
Remote