ML Research Engineer, AI Evaluation Platform

Seattle, Washington, USA Posted 14 days ago

$171,600 - $258,100/year

Role Details

This is a combined research and engineering role, sitting with and between research/applied scientists and platform engineers. New evaluation research can be challenging to use at scale—that's where your skills in both machine learning and engineering come into play. On the research side, you will partner with scientists to rapidly prototype their ideas, implement methods from recent papers, run large-scale experiments, and provide critical feedback grounded in your engineering experience. On the engineering side, you will work with platform engineers to bring those research prototypes into production—moving from Python packages on local machines to robust services deployed in the cloud. While past experience in research is not required, a desire to advance the state of the art in AI evaluation is. You should be ready to jump in across the full lifecycle of bringing new research into production at scale, speaking both the language of research and the language of engineering. Rapid Prototyping & Experimentation: Collaborate with research and applied scientists to translate evaluation research ideas into working prototypes—implementing methods from recent papers, building experimental pipelines, and iterating quickly to validate hypotheses in areas such as preference learning, LLM-as-judge calibration, and automated failure discovery. Research-to-Production Bridge: Own the lifecycle of moving evaluation methods from research prototypes to production-ready systems. Refactor research code into robust, well-tested Python packages and partner with platform engineers to deploy them as scalable services, APIs, and SDK components. Experiment Infrastructure: Design and maintain the infrastructure for running large-scale evaluation experiments—orchestrating LLM judge calls, managing datasets, tracking experiment results, and ensuring reproducibility across the team's research portfolio. Technical Feedback & Collaboration: Serve as a critical technical partner to researchers, providing engineering perspective on feasibility, scalability, and system design. Identify opportunities where engineering improvements (parallelization, caching, smarter batching) can unlock new research directions or dramatically accelerate experimentation. Scaling Evaluation Methods: Identify bottlenecks in evaluation workflows and engineer solutions to operate at Apple scale—optimizing for throughput, cost, and reliability when running evaluation methods across large model populations and diverse use cases. Code Quality & Engineering Standards: Champion engineering best practices within the research workflow, including version control, automated testing, documentation, and CI/CD, raising the bar for code quality across the research-engineering boundary. Cross-Functional Integration: Work across the research and platform engineering teams to ensure that evaluation methods integrate seamlessly with Apple's broader ML infrastructure, developer workflows, and internal tooling ecosystem. Bachelor's degree in Computer Science, Machine Learning, Software Engineering, or a closely related field (Master's preferred) 2+ years of hands-on experience in a role combining machine learning and software engineering (e.g., ML engineer, research engineer, or applied scientist with strong engineering output), or a Master's degree in Computer Science, Machine Learning, or a closely related field with relevant project experience Strong proficiency in Python and the modern ML ecosystem (PyTorch, JAX, or TensorFlow), with demonstrated ability to implement complex methods from recent ML papers Solid software engineering fundamentals: clean code design, version control, testing, debugging, and performance optimization Experience working with large language models—whether fine-tuning, inference, prompting pipelines, or building LLM-powered applications Demonstrated ability to work across the research-to-production spectrum: you have taken experimental or prototype code and made it robust, scalable, and usable by others Practical experience with cloud-native development and deployment: containerization (Docker/Kubernetes), CI/CD pipelines, and distributed computing frameworks (e.g., Ray, Spark) Strong communication skills and comfort working in interdisciplinary teams, with the ability to engage productively with both researchers and platform engineers Comfort with ambiguity and new problem spaces—you thrive when building something that doesn't yet have a playbook Master's or Ph.D. in Computer Science, Machine Learning, or a related field Experience with evaluation-specific methods or frameworks: LLM-as-judge approaches, reward modeling, RLHF, calibration techniques, benchmark design, or human evaluation methodology Familiarity with modern evaluation tools and frameworks (e.g., DeepEval, Ragas, TruLens, LangSmith) and an understanding of how to implement and scale model-based evaluation workflows Track record of contributing to research outputs—co-authored publications, open-source contributions, or internal research reports—even if research is not your primary role Experience with the engineering challenges specific to generative AI and agentic systems: managing token economics, handling non-deterministic outputs, evaluating multi-turn agent trajectories and tool usage Familiarity with statistical concepts relevant to evaluation: calibration, inter-rater reliability, scoring rules, or measurement validity Experience in fast-moving, early-stage teams where you helped define technical direction and engineering culture from the ground up

For more details click Job Post.

About Apple Inc

Apple Inc. is a multinational technology company known for designing and manufacturing consumer electronics, software, and online services, including the iPhone, Mac, iPad, and App Store. Industry: Consumer Electronics & Software

View All Jobs →