Principal Software Engineer – AI Platform (Production Engineering / Reliability)
At a glance
AI generatedTL;DR
As a Principal Individual Contributor in production engineering and observability at our AI Platform, you will lead efforts to ensure high availability, performance, and reliability of mission-critical systems. Your day-to-day responsibilities include defining and driving best-in-class production practices, building robust monitoring and alerting ecosystems, and establishing operational readiness standards for new AI capabilities. You will work with modern tools like Prometheus, Grafana, OpenTelemetry, and Azure Monitor to design end-to-end observability systems and develop actionable alerts tied to business impact. Additionally, you will partner with ML engineers to improve model production readiness and mentor senior engineers while setting direction for operational excellence at an organizational scale. This role requires 10+ years of experience in software engineering or SRE roles, deep expertise in large-scale distributed systems, and familiarity with cloud platforms like Azure, AWS, and GCP.
Skills
What you'll do
- Own and evolve production operations strategy for AI/ML platforms.
- Define SLOs, SLIs, and error budgets for mission-critical AI systems.
- Lead root cause analysis and drive systemic improvements post-incident.
- Build end-to-end observability systems across AI workloads using modern tooling.
- Develop actionable alerts tied to business impact and system performance.
- Ensure reliable deployment and operation of real-time inference services and model pipelines.
What we're looking for
- Over 10 years of experience in software engineering, production engineering, or SRE roles.
- Deep expertise in operating large-scale distributed systems in production environments.
- Proven success in building monitoring, observability, and alerting systems.
- Strong background in incident management and production support models.
- Experience with cloud platforms such as Azure, AWS, and GCP.
- Familiarity with AI/ML platforms or data-intensive systems (preferred).
- Knowledge of modern observability tools like OpenTelemetry, Prometheus, Grafana, and Datadog.
Employer
About CVS Health
CVS Health is a leading American healthcare company operating retail pharmacies, pharmacy benefit management services, and a health insurance segment through Aetna, one of the nation''s largest health insurers. Industry: Healthcare & Pharmacy
CVS Health currently has 104 open roles on FindRole.
Listed pay typically runs $118,450–$284,280 across 100 roles with salary data.
Most-posted roles
- Senior Software Development Engineer 9
- Staff Software Development Engineer 3
- Principal Software Engineer 2
- Senior Engineering Manager Conversational AI 2
- Senior Manager - Software Development Engineering 2