Observability and IaC Engineer

Fremont, Ca,Us, USA Posted today

$114,000 - $253,000/year

Role Details

The group you’ll be a part of

The Global Information Systems Group is dedicated to the success of Lam through providing best-in-class and innovative information system solutions and services. Together, we support users globally with data, information, and systems to achieve their business objectives.

The impact you’ll make

Our team at Lam is seeking a hands-on Observability Lead with a strong Site Reliability Engineering (SRE) and multi-cloud networking foundation to join our GIS Infrastructure Platform Engineering team. You will lead engineers in delivering robust observability frameworks, SLA/SLO/SLI disciplines, DR/BCP programs, backup and restore operations, and end-to-end network reliability across Azure, AWS, and GCP. You will own the full-stack delivery of observability, reliability, and resilience capabilities across a global multi-cloud enterprise.

What you’ll do

Lead and grow a team delivering a world-class observability platform across global, multi-cloud production environments, including Azure, AWS, and GCP.
Define and enforce SLA, SLO, and SLI frameworks across all infrastructure and network domains, driving continuous improvement through effective error budget management.
Own end-to-end multi-cloud network observability, including VNet and VPC traffic flows, Transit Gateway routing, BGP peering health, and inter-region connectivity.
Design and govern multi-cloud networking architectures, including Azure VNet, AWS VPC and Transit Gateway, GCP VPC, and hybrid connectivity solutions such as ExpressRoute,
Direct Connect, and Cloud Interconnect.
Design and implement agentic AI workflows using LLM-based agents, RAG patterns, and orchestration frameworks to enable AIOps-driven fault detection and remediation.
Own disaster recovery (DR) and business continuity planning (BCP) strategy, including runbook authorship, multi-cloud failover validation, and periodic DR drills to ensure RTO and RPO commitments are met.
Lead backup and restore operations across multi-cloud and hybrid environments, incorporating automated validation and cross-cloud recovery workflows.
Build robust monitoring and alerting pipelines by integrating Prometheus, Grafana, Datadog, PagerDuty, ThousandEyes, Azure Monitor, CloudWatch, and Google Cloud Operations into a unified observability stack.
Drive automation-first practices through self-healing pipelines, remediation playbooks, and infrastructure-as-code (IaC) patterns to reduce toil and improve MTTR.
Lead P1, P2, and P3 incident response efforts, including structured post-mortems and action tracking.
Define and drive the multi-quarter roadmap for observability, reliability, networking, DR/BCP, and AI-assisted operations.
Support hiring, performance management, and career development for the team.

Who we’re looking for

A BS, MS, or PhD in Computer Science, Engineering, or a related field (or equivalent experience), with 12+ years of overall experience in Infrastructure, SRE, DevOps, or Network
Engineering and 6+ years of experience leading high-performing SRE, Observability, or Platform Engineering teams.
Proven expertise in defining, enforcing, and operating SLA, SLO, and SLI frameworks, including effective error budget management.
Hands-on experience with disaster recovery (DR) and business continuity planning (BCP), including RTO/RPO planning, failover testing, and continuity documentation.
Deep expertise in backup and restore operations across multi-cloud and hybrid environments.
Strong multi-cloud networking skills across Azure (VNet, ExpressRoute, Virtual WAN), AWS (VPC, Transit Gateway, Direct Connect), and GCP (VPC, Cloud Interconnect, VPC-SC).
Experience building and operating observability platforms, including tools such as Prometheus, Grafana, Datadog, PagerDuty, ThousandEyes, Splunk, or equivalent solutions, with a focus on network telemetry and flow analysis.
Deep expertise in automation, including Ansible, Terraform, Python, and self-healing infrastructure pipelines.
Hands-on experience with infrastructure as code (IaC), CI/CD pipelines, Kubernetes (AKS, EKS, GKE), and all three major cloud platforms.
Strong programming skills in Python or Go for tooling, automation, and system integrations.
Experience leading P1, P2, and P3 incident management, including ITSM integration (ServiceNow preferred).
Exceptional communication skills, with the ability to translate complex technical concepts into clear business value for engineering, product, and executive stakeholders.

Preferred qualifications

Experience with AIOps, including AI-assisted network fault detection, anomaly correlation, and auto-remediation.
Familiarity with agentic AI workflows, including LLM-based agents and RAG patterns, applied to observability and operational use cases.
Background in global WAN architectures, including MPLS and resilience strategies for multi-region enterprise environments.
Experience with compliance-driven disaster recovery and business continuity (DR/BCP) programs, including InfoSec audits, SOX, and ISO 22301 requirements.
Experience with FinOps and multi-cloud cost observability, including network egress visibility and cost optimization across Azure, AWS, and GCP.
Relevant cloud certifications, such as Azure AZ-700 or AZ-305, AWS ANS-C01 or SAP-C02, and GCP Professional Cloud Network Engineer or Architect.
Background in HPC, on-premises, or hybrid cloud environments.

Our commitment

We believe it is important for every person to feel valued, included, and empowered to achieve their full potential. By bringing unique individuals and viewpoints together, we achieve extraordinary results.

Lam Research ("Lam" or the "Company") is an equal opportunity employer. Lam is committed to and reaffirms support of equal opportunity in employment and non-discrimination in employment policies, practices and procedures on the basis of race, religious creed, color, national origin, ancestry, physical disability, mental disability, medical condition, genetic information, marital status, sex (including pregnancy, childbirth and related medical conditions), gender, gender identity, gender expression, age, sexual orientation, or military and veteran status or any other category protected by applicable federal, state, or local laws. It is the Company's intention to comply with all applicable laws and regulations. Company policy prohibits unlawful discrimination against applicants or employees.

Lam offers a variety of work location models based on the needs of each role. Our hybrid roles combine the benefits of on-site collaboration with colleagues and the flexibility to work remotely and fall into two categories – On-site Flex and Virtual Flex. ‘On-site Flex’ you’ll work 3+ days per week on-site at a Lam or customer/supplier location, with the opportunity to work remotely for the balance of the week. ‘Virtual Flex’ you’ll work 1-2 days per week on-site at a Lam or customer/supplier location, and remotely the rest of the time.

LI-DM1

Salary

CA San Francisco Bay Area Salary Range for this position: $114,000.00 - $253,000.00.

The above salary range for this position is relevant to applicants that reside or work onsite in the California, San Francisco Bay Area only. Salary offers will depend on factors that include the location you work from, your level, education, training, specific skills, years of experience and comparison to other employees already in this role. Actual salary may vary from salary offered due to numerous factors including but not limited to unpaid time off, unpaid leave, company mandated shutdown, and other relevant factors.

Our Perks and Benefits

At Lam, our people make amazing things possible. That’s why we invest in you throughout the phases of your life with a comprehensive set of outstanding benefits.

For more details click Job Post.

About Lam Research

Lam Research Corporation is a leading American supplier of wafer-fabrication equipment and services to the global semiconductor industry.

View All Jobs →