Senior Reliability Engineer

Uber

Quick summary

Work type: On-site
Location: Sunnyvale, CA
Posted: 6 days ago
Nearby: 99+ roles within 25 mi

Market check

Salary context

How this pay compares to similar roles

Similar $171k

$132k most similar roles pay here $213k

This listing doesn't post a salary. Most similar roles pay $144,339–$198,200.

Based on 240 similar postings.

Employer

About Uber

Uber Technologies, Inc. is the world’s largest, San Francisco-based mobile technology platform facilitating on-demand ride-hailing, food delivery (Uber Eats), and freight transportation across approximately 70 countries.

Uber currently has 45 open roles on FindRole.

Most-posted roles

View all roles at Uber

At a glance

TL;DR · Senior Reliability Engineer

Role Posting Log in to save

As a Senior Reliability Engineer at AV Labs, you will join a dedicated team focused on ensuring the reliable operation of Uber’s in-vehicle sensor data collection systems. Your primary responsibilities include architecting observability platforms that ingest and analyze real-time health telemetry from thousands of distributed vehicle nodes, developing edge-constrained systems for diverse hardware environments, and defining criticality models to distinguish transient anomalies from systemic issues impacting sensor uptime and data yield. You will also design automated detection mechanisms to eliminate manual intervention as the fleet scales, collaborate with Operations and Engineering teams to build safe, automated responses to recurring failures, and drive reliability-focused technical strategy through design reviews and roadmaps. This role requires expertise in distributed systems, observability platforms like Prometheus and Grafana, proficiency in languages such as Go or Python, and deep knowledge of Linux internals and networking protocols.

Skills

Prometheus Grafana ELK Go Python C++ Linux Docker Shell scripting SLIs and SLOs TCP/IP gRPC MQTT CI/CD AWS Kubernetes

What you'll do

Architect observability platforms to ingest and analyze real-time health data from distributed vehicle nodes.
Develop systems that maintain performance across diverse hardware with intermittent connectivity challenges.
Define alerting strategies to distinguish transient anomalies from systemic issues affecting sensor uptime.
Design detection logic for silent failures like sensor degradation, compute saturation, or recording pipeline stalls.
Create automated mechanisms to detect, triage, and mitigate issues as the fleet scales without manual intervention.

What we're looking for

5+ years of experience in software engineering, site reliability, or systems engineering.
Expertise in modern observability platforms like Prometheus, Grafana, and ELK for edge/IoT environments.
Proficiency in Go, Python, or C++ with production system development experience.
Deep knowledge of Linux internals and shell scripting for debugging hardware-related issues.
Proven reliability ownership for large-scale production systems, including SLIs/SLOs implementation.
Leadership in driving complex technical projects across multiple teams from design to production.

Save