Site Reliability Engineer (SRE) - AI Platform & Cloud

Morgan Stanley

Quick summary

Work type: On-site
Location: Alpharetta, GA
Posted: 47 days ago
Nearby: 99+ roles within 25 mi

Market check

Salary context

How this pay compares to similar roles

Similar $181k

$134k most similar roles pay here $229k

This listing doesn't post a salary. Most similar roles pay $150,075–$212,337.

Based on 240 similar postings.

Employer

About Morgan Stanley

Morgan Stanley is a global financial services firm providing investment banking, securities, wealth management, and investment management services to corporations, governments, institutions, and individuals. Industry: Investment Banking & Financial Services

Morgan Stanley currently has 39 open roles on FindRole.

Listed pay typically runs $140,000–$165,000 across 37 roles with salary data.

Most-posted roles

View all roles at Morgan Stanley

At a glance

TL;DR · Site Reliability Engineer (SRE) - AI Platform & Cloud

Apply Now Log in to save

Join Morgan Stanley’s AI Platform team as a Site Reliability Engineer (SRE) at the Director level, where you will support, scale, and harden the infrastructure that powers the firm's AI/ML systems. You’ll collaborate with various teams to ensure high availability, reliability, performance, and security in a regulated financial environment. Key responsibilities include operating and maintaining GenAI application infrastructure, designing automation for core platform capabilities, developing IaC for provisioning resources, establishing SLOs/SLIs/SLAs, leading incident response, optimizing cost-performance tradeoffs, and integrating new tools to enhance reliability. Ideal candidates have hands-on experience with Kubernetes, cloud platforms (AWS, Azure, Google), API development, REST frameworks, data engineering, and large-scale API Gateway environments, along with a strong background in AI and generative AI solutions.

Skills

Kubernetes AWS Azure Google Cloud Python Docker Terraform Prometheus Grafana REST framework API Gateway CI/CD PostgreSQL Redis Snowflake OpenTelemetry Slurm ModelOps MLOps LLM Op Chaos Engineering

What you'll do

Operate, monitor, and maintain infrastructure supporting GenAI applications.
Design and build automation to reduce manual tasks in core platform capabilities.
Develop and enforce service level objectives and error budgets for AI systems.
Lead incident response, conduct root cause analysis, and implement remediation strategies.
Optimize cost-performance tradeoffs in large-scale compute environments while ensuring security.
Collaborate with cross-functional teams to ensure safe deployment and integration of new systems.
Define disaster recovery practices and maintain operational documentation and training materials.

What we're looking for

5+ years of production experience in SRE or infrastructure operations for large-scale systems
Strong hands-on experience with Kubernetes, cloud platforms (AWS, Azure, Google Cloud), and API-based development
Deep expertise in containerization (Docker) and orchestration tools like Terraform and Helm
Solid understanding of monitoring, observability, logging, and alerting tools such as Prometheus and Grafana
Experience in regulated environments with a focus on security, compliance, auditability, and data governance
Excellent communication skills and ability to collaborate across multiple teams for system integration and deployment

Similar roles

ASE Compute - Site Reliability Engineering (SRE) Manager

Apple Inc

Seattle, WA 22 days ago $216,600–$325,500

Kubernetes Python Go Prometheus Terraform AWS GCP Azure Puppet Ansible CI/CD Docker Java

Save

Site Reliability Engineering (SRE) Manager

IBM

Research Triangle Park, NC 37 days ago

Kubernetes CI/CD Python Terraform AWS Azure GCP IBM Cloud OpenShift Jira Scrum Ansible Prometheus Grafana Git Docker Linux Networking Security Compliance Risk Management

Save

Site Reliability Engineering (SRE) Manager

IBM

Austin, TX 37 days ago

Kubernetes CI/CD Python Terraform AWS Azure GCP IBM Cloud OpenShift Jira Scrum Ansible Prometheus Grafana PostgreSQL Git Docker Linux SSL/TLS JSON/YAML

Save