Cloud SRE Engineer, Associate
Quick summary
- Work type
- On-site
- Location
- Dallas, TX
- Posted
- 1 day ago
- Nearby
- 99+ roles within 25 mi
Employer
About Goldman Sachs
Goldman Sachs is a leading global investment banking, securities, and investment management firm providing financial services to corporations, financial institutions, governments, and individuals.
Goldman Sachs currently has 187 open roles on FindRole.
Listed pay typically runs $130,000–$250,000 across 60 roles with salary data.
Most-posted roles
- Asset & Wealth Management - Software Engineer - Vice President - Dallas 3
- AMD Public-New York-Vice President-Software Engineering 2
- Internal Audit, Technology Auditor-Investment Banking, Associate 2
- Senior Software Engineer, Global Banking & Markets, Front Office Technology 2
- AI Engineering, Vice President (New York, New Jersey, Toronto) 1
At a glance
TL;DR · Cloud SRE Engineer, Associate
As a Cloud Site Reliability Engineer (SRE) at Goldman Sachs, you will join the WM Data Engineering ecosystem to ensure our AWS-based services are resilient and cost-effective. Your daily tasks include defining and enforcing Service Level Objectives using OpenSLO, implementing AI-driven observability stacks like Datadog or Amazon CloudWatch Container Insights for predictive monitoring, and leading incident response efforts with blameless post-mortems. You will also support the migration of on-premises microservices to Amazon ECS (Fargate/EC2) and develop Infrastructure as Code using Terraform or AWS CDK. Proficiency in Python or Go is essential for automation tasks, while experience with AWS core services, container orchestration, and observability tools like Prometheus and Grafana is required. This role demands a strong problem-solving mindset and the ability to communicate technical concepts effectively within a team environment.
Skills
What you'll do
- Define and enforce Service Level Objectives (SLOs) using OpenSLO to manage error budgets.
- Implement AI-driven observability stacks to predict and prevent system performance issues.
- Lead high-severity incident response and conduct blameless post-mortems for continuous improvement.
- Support the migration of on-premises microservices to Amazon ECS, ensuring reliable task definitions.
- Develop Infrastructure as Code (IaC) using Terraform or AWS CDK to automate infrastructure deployments.
- Identify and eliminate repetitive manual tasks by creating custom automation tools in Python or Go.
What we're looking for
- 4+ years of experience in SRE, DevOps, or Cloud Engineering roles with focus on production operations for distributed systems.
- Deep proficiency in Amazon ECS (Fargate and EC2 launch types) and Docker containerization.
- Strong programming skills in Python or Java for automation and tool development, and expert-level SQL for data analysis.
- Advanced knowledge of AWS core services including VPC, IAM, S3, Lambda, and networking technologies like Transit Gateway and PrivateLink.
- Hands-on experience with observability tools such as Prometheus, Grafana, AWS X-Ray, or Splunk.
- Proven ability to build automated deployment pipelines for ECS using CI/CD practices.