About Goldman Sachs

Goldman Sachs is a leading global investment banking, securities, and investment management firm providing financial services to corporations, financial institutions, governments, and individuals.

Goldman Sachs currently has 187 open roles on FindRole.

Listed pay typically runs $130,000–$250,000 across 60 roles with salary data.

Most-posted roles

View all roles at Goldman Sachs

At a glance

TL;DR · Cloud SRE Engineer, Associate

Role Posting Log in to save

As a Cloud Site Reliability Engineer (SRE) at Goldman Sachs, you will join the WM Data Engineering ecosystem to ensure our AWS-based services are resilient and cost-effective. Your daily tasks include defining and enforcing Service Level Objectives using OpenSLO, implementing AI-driven observability stacks like Datadog or Amazon CloudWatch Container Insights for predictive monitoring, and leading incident response efforts with blameless post-mortems. You will also support the migration of on-premises microservices to Amazon ECS (Fargate/EC2) and develop Infrastructure as Code using Terraform or AWS CDK. Proficiency in Python or Go is essential for automation tasks, while experience with AWS core services, container orchestration, and observability tools like Prometheus and Grafana is required. This role demands a strong problem-solving mindset and the ability to communicate technical concepts effectively within a team environment.

Skills

AWS Terraform Python Docker Amazon ECS OpenSLO Datadog Prometheus Grafana AWS CloudWatch Container Insights AWS CDK CI/CD SQL AWS X-Ray GitHub Actions AWS CodePipeline AWS Transit Gateway AWS PrivateLink

What you'll do

Define and enforce Service Level Objectives (SLOs) using OpenSLO to manage error budgets.
Implement AI-driven observability stacks to predict and prevent system performance issues.
Lead high-severity incident response and conduct blameless post-mortems for continuous improvement.
Support the migration of on-premises microservices to Amazon ECS, ensuring reliable task definitions.
Develop Infrastructure as Code (IaC) using Terraform or AWS CDK to automate infrastructure deployments.
Identify and eliminate repetitive manual tasks by creating custom automation tools in Python or Go.

What we're looking for

4+ years of experience in SRE, DevOps, or Cloud Engineering roles with focus on production operations for distributed systems.
Deep proficiency in Amazon ECS (Fargate and EC2 launch types) and Docker containerization.
Strong programming skills in Python or Java for automation and tool development, and expert-level SQL for data analysis.
Advanced knowledge of AWS core services including VPC, IAM, S3, Lambda, and networking technologies like Transit Gateway and PrivateLink.
Hands-on experience with observability tools such as Prometheus, Grafana, AWS X-Ray, or Splunk.
Proven ability to build automated deployment pipelines for ECS using CI/CD practices.