Evaluation Reliability SRE

Apple Inc

Actively hiring Posted this week
Cupertino, CA Posted 2 days ago $212,000$318,400 / year

At a glance

AI generated

TL;DR

As a senior Site Reliability Engineer (SRE) on the Evaluation Reliability Engineering (ERE) team at Siri, you will play a critical role in ensuring the reliability of the evaluation infrastructure stack, including orchestration, capacity management, and service health. Your day-to-day responsibilities include leading incident investigations, authoring high-quality runbooks for complex failure scenarios, and building deep expertise in device orchestration and provisioning layers to diagnose upstream issues independently. You will also instrument infrastructure components lacking observability, balance proactive reliability work with incident response, and partner on defining SLOs and burn-rate alerting. Fluency with agentic coding tools like Claude Code or Copilot is essential for automating runbooks and log analysis. Ideal candidates have extensive experience in site reliability engineering, hands-on orchestration skills, and a track record of improving system reliability through measurable outcomes.

Skills

Kubernetes Python Go Docker CI/CD Prometheus Grafana Claude Code Cursor Copilot AWS Terraform PostgreSQL GitOps

What you'll do

  • Own reliability outcomes across evaluation infrastructure: orchestration, capacity, and service health.
  • Lead incident investigations end-to-end and set operational standards for the team.
  • Build expertise in device orchestration and provisioning layers to diagnose issues independently.
  • Instrument infrastructure components lacking observability to detect failures proactively.
  • Balance incident response with proactive reliability work, focusing on automation and eliminating recurring failures.

What we're looking for

  • 5+ years of site reliability or infrastructure engineering experience with direct production system ownership
  • Hands-on experience with Kubernetes or equivalent orchestration tools for cluster health and resource management
  • Expertise in device or VM provisioning pipelines and virtualization-layer failure modes
  • Proven track record of improving system reliability through measurable outcomes like uptime, MTTR, and incident frequency
  • Incident command discipline to lead multi-team incidents from declaration to resolution
  • Depth in distributed systems reliability, device management infrastructure, evaluation, or ML platform operations

Employer

About Apple Inc

Apple Inc. is a multinational technology company known for designing and manufacturing consumer electronics, software, and online services, including the iPhone, Mac, iPad, and App Store. Industry: Consumer Electronics & Software

Apple Inc currently has 255 open roles on FindRole.

Listed pay typically runs $171,600–$272,100 across 182 roles with salary data.

Most-posted roles

View all roles at Apple Inc