Senior Site Reliability Engineer, Production Engineering

Anduril Industries

Quick summary

Work type: On-site
Location: Seattle, WA
Salary: $166,000–$220,000 / yr
Posted: today

Market check

Salary context

Above market

How this pay compares to similar roles

Similar $166k

This role $193k

$122k most similar roles pay here $230k

This role pays more than 80% of similar roles. Most pay $139,100–$193,000 — the shaded band above. At the midpoint, this role pays about $193k versus about $166k for comparable roles.

Based on 239 similar postings.

Employer

About Anduril Industries

Anduril Industries is a defense technology company that builds advanced hardware and software systems for national security, including autonomous drones, surveillance systems, and the Lattice AI command platform.

Anduril Industries currently has 1882 open roles on FindRole.

Listed pay typically runs $146,000–$194,000 across 1696 roles with salary data.

Most-posted roles

View all roles at Anduril Industries

At a glance

TL;DR · Senior Site Reliability Engineer, Production Engineering

Apply Now Log in to save

As a Senior Site Reliability Engineer at Anduril's Production Engineering team, you will play a crucial role in ensuring the reliability and scalability of Lattice, the company’s autonomous command and control platform. Your responsibilities include designing comprehensive monitoring systems, driving incident response, building automation with tools like Terraform and Kubernetes operators, establishing SLOs, and improving system architecture for better reliability. You will also develop capacity planning models, create runbooks, and lead cross-functional efforts to enhance deployment safety. The ideal candidate has deep expertise in Kubernetes, strong programming skills in languages such as Go or Python, and experience with observability stacks like Prometheus and Grafana. This role requires a U.S. Secret security clearance and offers the opportunity to work on mission-critical systems that directly impact national security at massive scale.

Skills

Kubernetes Terraform Go Python Rust Java Prometheus Grafana AWS Azure GCP CI/CD PostgreSQL Istio Linkerd Vault Sealed Secrets SOPS Jenkins ArgoCD FluxCD Spinnaker

What you'll do

Design and implement comprehensive monitoring, observability, and alerting systems.
Drive incident response and conduct blameless postmortems to prevent recurrence of issues.
Build infrastructure automation using Terraform, Kubernetes operators, and custom tooling.
Establish Service Level Objectives (SLOs) and Error Budgets for system reliability.
Partner with software engineering teams to improve system architecture for reliability.
Develop capacity planning models and performance testing frameworks for peak demands.
Create runbooks and documentation to enable effective operation of production systems.

What we're looking for

7+ years of engineering experience, including at least 3 years in SRE or production operations.
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.
Deep expertise with Kubernetes and cloud platforms (AWS, Azure, GCP).
Strong programming skills in Go, Python, Rust, or Java for building production tooling.
Proven ability to design observability stacks using Prometheus, Grafana, ELK/EFK.
Demonstrated track record of improving system reliability through architectural changes.
Must hold a U.S. Secret security clearance.