Principal Software Engineer – AI Platform (Production Engineering / Reliability)

CVS Health

Remote Actively hiring Posted this week
Remote (Work At Home-Texas, US) Posted 4 days ago $144,200$288,400 / year

At a glance

AI generated

TL;DR

As a Principal Individual Contributor in production engineering and observability at our AI Platform, you will lead efforts to ensure high availability, performance, and reliability of mission-critical systems. Your day-to-day responsibilities include defining and driving best-in-class production practices, building robust monitoring and alerting ecosystems, and establishing operational readiness standards for new AI capabilities. You will work with modern tools like Prometheus, Grafana, OpenTelemetry, and Azure Monitor to design end-to-end observability systems and develop actionable alerts tied to business impact. Additionally, you will partner with ML engineers to improve model production readiness and mentor senior engineers while setting direction for operational excellence at an organizational scale. This role requires 10+ years of experience in software engineering or SRE roles, deep expertise in large-scale distributed systems, and familiarity with cloud platforms like Azure, AWS, and GCP.

Skills

Prometheus Grafana OpenTelemetry Kubernetes AWS Azure CI/CD SLO-driven engineering Infrastructure as Code (IaC) Docker Terraform Python PostgreSQL Datadog Model observability MLOps Streaming systems High-availability systems

What you'll do

  • Own and evolve production operations strategy for AI/ML platforms.
  • Define SLOs, SLIs, and error budgets for mission-critical AI systems.
  • Lead root cause analysis and drive systemic improvements post-incident.
  • Build end-to-end observability systems across AI workloads using modern tooling.
  • Develop actionable alerts tied to business impact and system performance.
  • Ensure reliable deployment and operation of real-time inference services and model pipelines.

What we're looking for

  • Over 10 years of experience in software engineering, production engineering, or SRE roles.
  • Deep expertise in operating large-scale distributed systems in production environments.
  • Proven success in building monitoring, observability, and alerting systems.
  • Strong background in incident management and production support models.
  • Experience with cloud platforms such as Azure, AWS, and GCP.
  • Familiarity with AI/ML platforms or data-intensive systems (preferred).
  • Knowledge of modern observability tools like OpenTelemetry, Prometheus, Grafana, and Datadog.

Employer

About CVS Health

CVS Health is a leading American healthcare company operating retail pharmacies, pharmacy benefit management services, and a health insurance segment through Aetna, one of the nation''s largest health insurers. Industry: Healthcare & Pharmacy

CVS Health currently has 104 open roles on FindRole.

Listed pay typically runs $118,450–$284,280 across 100 roles with salary data.

Most-posted roles

View all roles at CVS Health