Observability Lead - Cloud SRE & Network Reliability

Lam Research

Actively hiring
Fremont, CA Posted 45 days ago $114,000$253,000 / year

At a glance

AI generated

TL;DR

Join our GIS Infrastructure Platform Engineering team as an Observability Lead, a senior role requiring extensive experience in Site Reliability Engineering (SRE) and multi-cloud networking. You will lead a team responsible for delivering robust observability frameworks, SLA/SLO/SLI disciplines, disaster recovery plans, backup operations, and network reliability across Azure, AWS, and GCP. Daily tasks include designing multi-cloud networking architectures, implementing AI-driven workflows for AIOps, and building monitoring pipelines using tools like Prometheus, Grafana, and Datadog. The ideal candidate has a deep understanding of automation with Ansible, Terraform, Python, and Kubernetes, as well as strong programming skills in Python or Go. Experience with compliance-driven disaster recovery programs and multi-cloud cost observability is preferred.

Skills

Azure AWS GCP Prometheus Grafana Datadog PagerDuty Terraform Python Kubernetes CI/CD Ansible Go IaC SLA/SLO/SLI DR/BCP ExpressRoute DirectConnect CloudInterconnect MPLS LLM-based agents RAG patterns

What you'll do

  • Lead the development of a world-class observability platform across Azure, AWS, and GCP.
  • Define and enforce SLA/SLO/SLI frameworks to drive continuous improvement in error budget management.
  • Design multi-cloud networking architectures for robust inter-region connectivity and reliability.
  • Implement agentic AI workflows using LLM-based agents for AIOps-driven fault detection.
  • Own disaster recovery and business continuity planning, including failover validation and DR drills.

What we're looking for

  • Over 12 years of experience in Infrastructure, SRE, DevOps, or Network Engineering
  • At least 6 years of leadership experience in high-performing SRE or Platform Engineering teams
  • Expertise in defining and enforcing SLA/SLO/SLI frameworks with error budget management
  • Hands-on experience in disaster recovery (DR) and business continuity planning (BCP)
  • Deep knowledge of multi-cloud networking across Azure, AWS, and GCP
  • Experience building and operating observability platforms using Prometheus, Grafana, Datadog, etc.
  • Strong skills in automation, including Ansible, Terraform, Python, and self-healing pipelines

Market check

Salary context

This $114,000–$253,000 range sits above 58% of similar postings on FindRole.

Peer median band

$130,150$214,075

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$142,400$214,012

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Lam Research

Lam Research Corporation is a leading American supplier of wafer-fabrication equipment and services to the global semiconductor industry.

Lam Research currently has 238 open roles on FindRole.

Listed pay typically runs $114,000–$231,000 across 116 roles with salary data.

Most-posted roles

View all roles at Lam Research

More like this

Similar roles

Principal Architect - Cloud and Observability

CVS Health

Remote (Work At Home-Illinois, US) 53 days ago $144,200$288,400
OpenTelemetry Grafana Mimir Loki Tempo Kubernetes AWS Azure GCP Prometheus Datadog Splunk Dynatrace SLOs SLIs SPIFFE SPIRE Terraform FinOps ServiceNow xMatters
Remote

Product Owner, Mainframe Networking & Observability

Broadcom

Pittsburgh, PA 69 days ago $104,100$166,500
Agile Scrum Product Backlog Management User Stories Acceptance Criteria CI/CD Mainframe z/OS TSO/ISPF JCL TCP/IP SNA Machine Learning Observability Networking Dashboarding Tools

Staff Observability Platform Engineer (SRE)

CVS Health

Remote (Scottsdale, Arizona) 10 days ago $118,450$236,900
Prometheus Grafana Kubernetes AWS Python Java OpenTelemetry PostgreSQL Docker CI/CD Terraform MySQL Loki Tempo
Remote