Site Reliability Engineer (Edge Services), Infrastructure Services

Austin, Texas, USA Posted 42 days ago

Role Details

The Edge Services team is on the hunt for a software engineer focused to champion the evolution of our production ecosystems. In this role, you will help drive the vision for our visibility, moving beyond simple uptime metrics to build a sophisticated, data-driven reliability framework. You will play a pivotal role in ensuring our services are resilient, scalable, and observable, bridging the gap between complex distributed systems and seamless user experiences. We’re seeking an engineer who is passionate about building system software, solving seemingly insurmountable problems, and deeply committed to delivering an outstanding customer experience. You'll go beyond the industry standard, demonstrating creativity in problem-solving, the ability to think dynamically, and the agility to adapt quickly to new technical areas. You will be responsible for designing and implementing observability and alerting strategies. Building self-healing systems, reducing toil through automation, and partnering with development teams to ensure reliability. Proactively identify and mitigate performance bottlenecks before they impact customers. Systems Expertise: Strong understanding of Linux internals and deep networking expertise, including HTTP/2, HTTP/3 (QUIC), and HTTPS/TLS. You should be comfortable debugging protocol-level issues and optimizing traffic flow. Automation Mindset: Proven ability to automate repetitive tasks and complex workflows using Python or Go Observability Logic: Experience configuring and managing modern monitoring suites (e.g., Prometheus, Grafana, ClickHouse) with a focus on creating actionable, high-signal quality alerting. CS Fundamentals: Solid grasp of Data Structures and Algorithms (DSA) to write efficient, performant code and troubleshoot complex system bottlenecks. SRE Principles: Practical knowledge of SLIs, SLOs, Error Budgets, Release Management and Incident Management to drive engineering priorities. BS in Computer Science or a related field or equivalent job-related experience Infrastructure as Code: Experience managing cloud environments (AWS, GCP, or Azure) using Terraform, Ansible, or Pulumi. Orchestration: Hands-on experience scaling and securing containerized workloads via Kubernetes. Incident Response: A track record of leading "blameless post-mortems" and using those insights to harden the system against future failures. Architectural Influence: Ability to consult with product teams on service design to improve long-term maintainability. Reliability Engineering: A proactive engineering mindset focused on shifting from "fixing things when they break" to "designing things so they don't break" (or so they fail gracefully).

For more details click Job Post.

About Apple Inc

Apple Inc. is a multinational technology company known for designing and manufacturing consumer electronics, software, and online services, including the iPhone, Mac, iPad, and App Store. Industry: Consumer Electronics & Software

View All Jobs →