AI Platform Reliability Engineer

Oracle

Quick summary

Work type
On-site
Location
Austin, TX
Salary
$79,200–$209,500 / yr
Posted
12 days ago

Market check

Salary context

Below market

How this pay compares to similar roles

Similar $203k
This role $144k
$59k most similar roles pay here $264k

This role pays less than 85% of similar roles. Most pay $162,000–$244,287 — the shaded band above. At the midpoint, this role pays about $144k versus about $203k for comparable roles.

Based on 240 similar postings.

Employer

About Oracle

Oracle Corporation is a leading multinational technology company specializing in database software, cloud computing, and enterprise software.

Oracle currently has 755 open roles on FindRole.

Listed pay typically runs $97,500–$209,500 across 568 roles with salary data.

Most-posted roles

View all roles at Oracle

At a glance

TL;DR · AI Platform Reliability Engineer

Oracle Health is hiring an AI Platform Reliability Engineer, a senior individual contributor role within the product development team, focusing on ensuring the reliability of its AI agent platform and analytics workflows. This engineer will build observability tools for monitoring and tracing AI systems, implement robust release management practices including rollback controls, and design evaluation strategies to detect performance issues in production. The ideal candidate has expertise in CI/CD, incident response, and operational tooling, with a strong background in reliability engineering and scripting or software development. This role is crucial for maintaining trust in AI outputs and ensuring that new capabilities can scale safely across Oracle Health’s enterprise platforms.

What you'll do

  • Build and maintain observability, logging, tracing, and monitoring for AI agents and workflows.
  • Implement release, rollout, rollback, and versioning controls for AI systems and configurations.
  • Design production evaluation practices to detect regressions, silent failures, and performance issues.
  • Support incident response, triage, root-cause analysis, and operational reporting for AI issues.
  • Contribute to data monitoring and reliability workflows, including detection of anomalies and schema drift.
  • Implement latency, throughput, and cost monitoring controls for AI-enabled systems.

What we're looking for

  • 6+ years of experience in AI/ML platform operations, including CI/CD and release management.
  • Strong background in observability, logging, tracing, monitoring, and alerting for production systems.
  • Expertise in incident response, root cause analysis, and operational reporting for complex issues.
  • Experience with evaluation practices to detect regressions, silent failures, quality drift, and performance issues.
  • Proficient in scripting or software development, versioning, and configuration management.

More like this

Similar roles

Senior AI Site Reliability Engineer

Oracle

US 19 days ago
AWS Azure OCI Kubernetes Terraform Python Java Go Docker Prometheus Grafana CI/CD Vertica Snowflake Tableau Power BI Oracle Analytics LangChain AutoGPT Jenkins

AI Enablement Engineer

Electronic Arts

Vancouver, British Columbia, Canada +2 13 days ago $122,300$170,700
Python AWS C# JavaScript HTML CSS Cursor GitHub Copilot Claude Code Prompt and context engineering AI agents MCP AI assistants RAG Vector databases Model tuning CI/CD Observability Infrastructure as code Unity Unreal
Hybrid

Applied AI Engineer

Ramp

Remote (New York City, New York, US) 155 days ago $155,000$339,500
Python JavaScript Node.js Django Flask React PostgreSQL MongoDB AWS GCP Kubernetes Terraform CI/CD GitOps
Remote

Applied AI Engineer

Booz Allen Hamilton

Fort Belvoir, VA +1 32 days ago $99,000$225,000
Python FastAPI Flask Streamlit Gradio React TypeScript Kubernetes CI/CD Prometheus Grafana MLOps Docker PostgreSQL AWS Azure Google Cloud Platform

Applied AI Engineer

Apple Inc

Cupertino, CA 35 days ago $181,100$272,100
Python FastAPI LangChain LLMs GenAI RESTful APIs Vector databases Async programming Pipeline orchestration Prometheus OpenTelemetry Redis RabbitMQ Kafka Docker CI/CD

AI Engineer

Fiserv

Columbus, OH +2 1 day ago $109,000$182,400
Python R SQL Hadoop Spark Databricks Machine Learning Classification Clustering Anomaly Detection Time Series CI/CD MLOps Endpoint Protection Identity and Access Data Network Telemetry Data Visualization AWS Azure