Site Reliability Engineer (SRE) - AI Platform & Cloud

Morgan Stanley

Quick summary

Work type
On-site
Location
Alpharetta, GA
Posted
47 days ago

Market check

Salary context

How this pay compares to similar roles

Similar $181k
$134k most similar roles pay here $229k

This listing doesn't post a salary. Most similar roles pay $150,075–$212,337.

Based on 240 similar postings.

Employer

About Morgan Stanley

Morgan Stanley is a global financial services firm providing investment banking, securities, wealth management, and investment management services to corporations, governments, institutions, and individuals. Industry: Investment Banking & Financial Services

Morgan Stanley currently has 39 open roles on FindRole.

Listed pay typically runs $140,000–$165,000 across 37 roles with salary data.

Most-posted roles

View all roles at Morgan Stanley

At a glance

TL;DR · Site Reliability Engineer (SRE) - AI Platform & Cloud

Join Morgan Stanley’s AI Platform team as a Site Reliability Engineer (SRE) at the Director level, where you will support, scale, and harden the infrastructure that powers the firm's AI/ML systems. You’ll collaborate with various teams to ensure high availability, reliability, performance, and security in a regulated financial environment. Key responsibilities include operating and maintaining GenAI application infrastructure, designing automation for core platform capabilities, developing IaC for provisioning resources, establishing SLOs/SLIs/SLAs, leading incident response, optimizing cost-performance tradeoffs, and integrating new tools to enhance reliability. Ideal candidates have hands-on experience with Kubernetes, cloud platforms (AWS, Azure, Google), API development, REST frameworks, data engineering, and large-scale API Gateway environments, along with a strong background in AI and generative AI solutions.

What you'll do

  • Operate, monitor, and maintain infrastructure supporting GenAI applications.
  • Design and build automation to reduce manual tasks in core platform capabilities.
  • Develop and enforce service level objectives and error budgets for AI systems.
  • Lead incident response, conduct root cause analysis, and implement remediation strategies.
  • Optimize cost-performance tradeoffs in large-scale compute environments while ensuring security.
  • Collaborate with cross-functional teams to ensure safe deployment and integration of new systems.
  • Define disaster recovery practices and maintain operational documentation and training materials.

What we're looking for

  • 5+ years of production experience in SRE or infrastructure operations for large-scale systems
  • Strong hands-on experience with Kubernetes, cloud platforms (AWS, Azure, Google Cloud), and API-based development
  • Deep expertise in containerization (Docker) and orchestration tools like Terraform and Helm
  • Solid understanding of monitoring, observability, logging, and alerting tools such as Prometheus and Grafana
  • Experience in regulated environments with a focus on security, compliance, auditability, and data governance
  • Excellent communication skills and ability to collaborate across multiple teams for system integration and deployment

More like this

Similar roles

Site Reliability Engineering (SRE) Manager

IBM

Research Triangle Park, NC 37 days ago
Kubernetes CI/CD Python Terraform AWS Azure GCP IBM Cloud OpenShift Jira Scrum Ansible Prometheus Grafana Git Docker Linux Networking Security Compliance Risk Management