Observability Lead - Cloud SRE & Network Reliability

Lam Research

Actively hiring

Fremont, CA Posted 45 days ago $114,000–$253,000 / year

View original post Log in to save

At a glance

AI generated

TL;DR

Join our GIS Infrastructure Platform Engineering team as an Observability Lead, a senior role requiring extensive experience in Site Reliability Engineering (SRE) and multi-cloud networking. You will lead a team responsible for delivering robust observability frameworks, SLA/SLO/SLI disciplines, disaster recovery plans, backup operations, and network reliability across Azure, AWS, and GCP. Daily tasks include designing multi-cloud networking architectures, implementing AI-driven workflows for AIOps, and building monitoring pipelines using tools like Prometheus, Grafana, and Datadog. The ideal candidate has a deep understanding of automation with Ansible, Terraform, Python, and Kubernetes, as well as strong programming skills in Python or Go. Experience with compliance-driven disaster recovery programs and multi-cloud cost observability is preferred.

Skills

Azure AWS GCP Prometheus Grafana Datadog PagerDuty Terraform Python Kubernetes CI/CD Ansible Go IaC SLA/SLO/SLI DR/BCP ExpressRoute DirectConnect CloudInterconnect MPLS LLM-based agents RAG patterns

What you'll do

Lead the development of a world-class observability platform across Azure, AWS, and GCP.
Define and enforce SLA/SLO/SLI frameworks to drive continuous improvement in error budget management.
Design multi-cloud networking architectures for robust inter-region connectivity and reliability.
Implement agentic AI workflows using LLM-based agents for AIOps-driven fault detection.
Own disaster recovery and business continuity planning, including failover validation and DR drills.

What we're looking for

Over 12 years of experience in Infrastructure, SRE, DevOps, or Network Engineering
At least 6 years of leadership experience in high-performing SRE or Platform Engineering teams
Expertise in defining and enforcing SLA/SLO/SLI frameworks with error budget management
Hands-on experience in disaster recovery (DR) and business continuity planning (BCP)
Deep knowledge of multi-cloud networking across Azure, AWS, and GCP
Experience building and operating observability platforms using Prometheus, Grafana, Datadog, etc.
Strong skills in automation, including Ansible, Terraform, Python, and self-healing pipelines

Market check

Salary context

This $114,000–$253,000 range sits above 58% of similar postings on FindRole.

Peer median band

$130,150–$214,075

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$142,400–$214,012

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Lam Research

Lam Research Corporation is a leading American supplier of wafer-fabrication equipment and services to the global semiconductor industry.

Lam Research currently has 238 open roles on FindRole.

Listed pay typically runs $114,000–$231,000 across 116 roles with salary data.

Most-posted roles

View all roles at Lam Research

Similar roles

Principal Architect - Cloud and Observability

CVS Health

Remote (Work At Home-Illinois, US) 53 days ago $144,200–$288,400

OpenTelemetry Grafana Mimir Loki Tempo Kubernetes AWS Azure GCP Prometheus Datadog Splunk Dynatrace SLOs SLIs SPIFFE SPIRE Terraform FinOps ServiceNow xMatters

Remote

Software Engineering Technical Leader - Observability Platforms (Remote or Hybrid)

Cisco

Remote (Usa-Boulder, US) 8 days ago $174,700–$253,400

AWS GCP Azure Kubernetes Terraform Python Golang Docker Prometheus Splunk Linux DevOps SRE CI/CD PostgreSQL Timeseries databases

Remote

Principal Firmware Engineer – Server Manageability and Observability

Nvidia

Santa Clara, CA 16 days ago $272,000–$431,250

Linux NVIDIA GPUs InfiniBand TCP/IP Ethernet Redfish IPMI MCTP PLDM SPDM RDE OpenBMC CUDA cuDNN DOCA OCP DMTF HPC CI/CD

Product Owner, Mainframe Networking & Observability

Broadcom

Pittsburgh, PA 69 days ago $104,100–$166,500

Agile Scrum Product Backlog Management User Stories Acceptance Criteria CI/CD Mainframe z/OS TSO/ISPF JCL TCP/IP SNA Machine Learning Observability Networking Dashboarding Tools

Staff Observability Platform Engineer (SRE)

CVS Health

Remote (Scottsdale, Arizona) 10 days ago $118,450–$236,900

Prometheus Grafana Kubernetes AWS Python Java OpenTelemetry PostgreSQL Docker CI/CD Terraform MySQL Loki Tempo

Remote

Lead Data Engineer (Cloud Operations Resilience Engineering)

Capital One Financial

McLean, Virginia 24 days ago $197,300–$225,100

AWS Kubernetes Terraform Python PostgreSQL Docker CI/CD Prometheus Grafana Git Jenkins