Role Details
Own end‑to‑end reliability metrics, including signal definition, instrumentation, monitoring, alerting, and ongoing metric quality. Act as a Designated Responsible Individual (DRI) for live‑site reliability, including on‑call participation, incident mitigation, post‑incident reviews, and driving long‑term corrective actions. Partner with feature teams to influence design‑for‑reliability and resiliency decisions, preventing regressions before release. Analyze telemetry and customer feedback to identify reliability gaps and trends, integrating learnings into the engineering lifecycle. Collaborate and mentor engineers across product, research, and engineering teams by sharing best practices in telemetry, feedback loops, and reliability, and by providing technical guidance and code reviews that raise the overall engineering bar. Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience. 4+ years of experience building and operating full‑stack, production‑grade software systems at scale. 2+ years of experience working with large‑scale telemetry systems and data analysis using SQL‑based query languages. Experience using modern AI‑assisted development tools such as GitHub Copilot or Claude Code to improve engineering productivity. These requirements include but are not limited to the following specialized security screenings: Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience. Hands‑on experience using modern AI‑assisted development tools such as GitHub Copilot or Claude Code to improve engineering productivity. Experience improving core fundamentals such as reliability, availability, and performance in customer‑facing systems. Proven ability to solve complex technical problems through cross‑team and cross‑organization collaboration. Experience operating high‑availability, globally distributed services. Knowledge of Azure infrastructure and multi‑cloud environments.
For more details click Job Post.
About Microsoft
Microsoft Corporation is a global technology leader producing software, hardware, and cloud services including Windows, Office 365, Azure cloud platform, Xbox gaming, and Surface devices. Industry: Software & Cloud Computing