About Goldman Sachs

Goldman Sachs is a leading global investment banking, securities, and investment management firm providing financial services to corporations, financial institutions, governments, and individuals.

Goldman Sachs currently has 187 open roles on FindRole.

Listed pay typically runs $130,000–$250,000 across 60 roles with salary data.

Most-posted roles

View all roles at Goldman Sachs

At a glance

TL;DR · Site Reliability Engineer, Asset & Wealth Management

Role Posting Log in to save

As a Vice President in Site Reliability Engineering at Goldman Sachs, you will lead the strategic direction for ensuring the availability, scalability, and performance of critical platform services. Your role involves architecting highly available systems, developing advanced automation tools, managing complex incidents, and conducting post-mortem analyses to enhance system resilience. You will collaborate with development teams on capacity planning and observability strategies, providing technical vision and mentorship while evaluating cutting-edge technologies for integration. The position requires extensive experience in SRE, proficiency in languages like Java, Python, or Go, expertise in cloud platforms (AWS, GCP), containerization tools (Docker, Kubernetes), and IaC solutions (Terraform). You will work on large-scale distributed systems, ensuring reliability across Goldman Sachs’ global operations.

Skills

Python Java Go AWS GCP Docker Kubernetes Terraform Puppet Chef Ansible Prometheus Grafana ELK_stack Datadog PagerDuty Jenkins GitLab Maven CI/CD Elastic_Search Big_Query Kafka

What you'll do

Drive strategic reliability and performance for mission-critical applications and services.
Lead the design and implementation of resilient infrastructure and application architectures.
Develop advanced automation solutions to optimize operational workflows across the enterprise.
Conduct root cause analysis and implement preventative measures for system stability.
Embed reliability into application design from inception, leading comprehensive capacity planning.
Define and implement monitoring strategies to provide deep insights into system performance.

What we're looking for

Minimum 6+ years of hands-on experience in Site Reliability Engineering at an enterprise level.
Expertise in cloud platforms (AWS, GCP), containerization, orchestration technologies (Docker, Kubernetes).
Mastery of Infrastructure as Code and configuration management tools (Terraform, Puppet, Ansible).
Advanced proficiency in monitoring, alerting, logging, and tracing solutions (Prometheus, Grafana, ELK stack).
Strong foundation in databases, distributed systems, and CI/CD practices.
Exceptional problem-solving abilities with a track record of resolving complex technical challenges.
Advanced degree in Computer Science or related technical field involving coding/systems engineering.