MTS - Site Reliability Engineer | Microsoft Careers

Microsoft

Hybrid Actively hiring
San Francisco Bay area · New York City metropolitan area Posted 84 days ago $119,800$234,700 / year

At a glance

AI generated

TL;DR

As a Site Reliability Engineer joining our infrastructure team, you will play a crucial role in maintaining the reliability and efficiency of our large-scale distributed AI infrastructure. Your day-to-day responsibilities include ensuring uptime and resiliency for AI model training and inference systems, designing monitoring and alerting systems, optimizing resource utilization across compute, GPU clusters, storage, and networking, building automation tools for deployments and incident response, leading on-call rotations to troubleshoot issues, conducting blameless postmortems, and collaborating with ML engineers to improve developer experience. You will need strong proficiency in Kubernetes, Docker, CI/CD pipelines, public cloud platforms like Azure/AWS/GCP, monitoring tools such as Grafana and Datadog, and programming skills in Python or Go. This role involves working on cutting-edge infrastructure that powers the future of Generative AI, impacting millions of users through reliable deployments.

Skills

Kubernetes Docker CI/CD AWS Azure GCP Terraform Python Go Bash Grafana Datadog OpenTelemetry Networking Storage GPU HPC Capacity_Planning Cost_Optimization

What you'll do

  • Ensure uptime and resiliency of AI model training and inference systems.
  • Design and maintain monitoring and alerting systems for real-time visibility.
  • Analyze system performance to optimize resource utilization in GPU clusters.
  • Build automation tools for deployments and incident response in hybrid clouds.
  • Lead on-call rotations, troubleshoot issues, and conduct blameless postmortems.
  • Ensure data privacy and secure operations across model training environments.

What we're looking for

  • 4+ years experience in Site Reliability Engineering or related field.
  • Proficiency in Kubernetes, Docker, and container orchestration tools.
  • Hands-on experience with public cloud platforms (Azure/AWS/GCP) and infrastructure-as-code practices.
  • Expertise in monitoring and observability tools like Grafana, Datadog, OpenTelemetry.
  • Strong programming skills in Python, Go, or Bash for automation and tooling.
  • Experience managing large-scale GPU clusters for ML/AI workloads.

Market check

Salary context

This $119,800–$234,700 range sits above 64% of similar postings on FindRole.

Peer median band

$120,750$202,200

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$142,450$195,000

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Microsoft

Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing

Microsoft currently has 451 open roles on FindRole.

Listed pay typically runs $119,800–$234,700 across 417 roles with salary data.

Most-posted roles

View all roles at Microsoft

More like this

Similar roles

MTS - Backend Engineer | Microsoft Careers

Microsoft

US 36 days ago $119,800$234,700
GraphQL REST Protobuf Thrift WebSocket SSE WebRTC Azure AWS GCP Docker Kubernetes CI/CD PostgreSQL Redis MongoDB Python Go Java C#

Site Reliability Engineer II | Microsoft Careers

Microsoft

US 161 days ago $100,600$199,000
Python Docker Kubernetes Terraform AWS CI/CD Git Linux Azure PostgreSQL Ansible Jenkins Prometheus Grafana JSON YAML REST OAuth PCI DSS

Careers - Senior Site Reliability Engineer

Block

New York, New York, US 48 days ago $189,000$283,600
AWS Terraform Kubernetes Istio Event driven architectures CI/CD DataDog LaunchDarkly Java Kotlin gRPC Protocol Buffers MySQL Vitess DynamoDB HTTP JSON