Data Engineer - Training Pipelines & Inference

Howard Hughes Medical Institute (HHMI)

Verified listing
Janelia Research Campus, US Posted 100 days ago

At a glance

AI generated

TL;DR

Join HHMI’s Foundational Microscopy Image Analysis (MIA) project as a Data Engineer and help build the data backbone for AI-powered spatial biology. You will design, develop, and optimize scalable data pipelines, including multi-node GPU training and inference systems, to support large, heterogeneous microscopy image datasets. Your daily tasks include writing production-quality Python code to parse and validate microscopy data from various sources, ensuring scientific integrity and reproducibility. This role requires expertise in data engineering, high-performance computing environments, and ML frameworks like PyTorch or JAX. You will collaborate with interdisciplinary teams of computational and experimental scientists, maintaining detailed documentation and potentially mentoring junior engineers. The project aims to create one of the world’s most comprehensive multimodal 3D/4D microscopy datasets, driving impactful scientific research in life sciences.

Skills

Python PyTorch JAX Kubernetes AWS SageMaker Google Vertex AI HDF5 Zarr Parquet webdataset Docker CI/CD Git Slurm LSF Matplotlib R Jupyter notebooks Cloud-based computing Multi-node GPU processing

What you'll do

  • Design and implement scalable data pipelines for foundational microscopy datasets.
  • Develop multi-node GPU training and inference pipelines for vision foundation models.
  • Maintain comprehensive documentation of data provenance and transformation steps.
  • Establish data standards, formats, and workflows to ensure quality and reproducibility.
  • Analyze large datasets using statistical tools and programming languages like Python and R.
  • Collaborate with interdisciplinary teams to define best practices in data engineering.

What we're looking for

  • Bachelor’s degree in Computer Science or related field with 3+ years of data engineering experience.
  • Experience with volumetric 3D/4D microscopy data analysis tools and high performance compute environments.
  • Proficiency in distributed data processing, multi-node GPU processing, and ML development frameworks like PyTorch and JAX.
  • Expertise in building scalable data solutions and ensuring data quality and accessibility.
  • Strong technical documentation and communication skills for multidisciplinary team collaboration.
  • Experience with data formats such as Zarr, Parquet, HDF5, and efficient IO methods.
  • Utilization of data visualization libraries like Matplotlib and proficiency in Python.

Market check

Salary context

This listing doesn't show a salary. Similar roles on FindRole typically pay $106,560–$187,790.

Peer median band

$106,560$187,790

Median floor and ceiling across peers.

Typical midpoint (25–75%)

$126,800$199,850

Middle half of comparable postings.

Based on 240 comparable postings.

* 240 is the maximum number of comparable postings sampled.

Employer

About Howard Hughes Medical Institute (HHMI)

Howard Hughes Medical Institute (HHMI) is one of the largest private biomedical research organizations in the world, funding basic research and science education to advance human health and knowledge. Industry: Biomedical Research & Science Education

Howard Hughes Medical Institute (HHMI) currently has 4 open roles on FindRole.

Most-posted roles

View all roles at Howard Hughes Medical Institute (HHMI)

More like this

Similar roles

Data Science Engineer

Booz Allen Hamilton

Locations Mclean, Virginia, US 10 days ago $77,600$176,000
Python SQL Kubernetes ETL Data Pipelines Spark AWS Terraform ELK OpenSearch

Data Engineer - AI and Analytics

CVS Health

Remote (Buffalo Grove-2100 E Lake Cook, US) 20 days ago $79,310$158,620
Python SQL NoSQL ETL ELT Data warehouses Big data Cloud architecture GCP AWS Azure Reporting tools Query optimization Metadata management Workload management Git CI/CD Bash shell scripts UNIX utilities Agile methodologies API development Micorservices SOA
Remote

Data Engineer II

The Walt Disney Company

Remote (Usa - Ny - 7 Hudson Square, US) 10 days ago $112,000$150,100
Scala Python Spark Airflow Databricks Delta Lake Snowflake AWS S3 SQL CI/CD Agile Scrum GraphQL Redshift BigQuery Terraform
Remote

Data Engineer

Sutter Health

Remote (2121 North California Boulevard Suite 310, US) 23 days ago $145,204$217,796
Cloudera Spark Python Databricks SQL Oracle T-SQL ANSI SQL Kafka Hadoop AWS Azure GCP CI/CD Git JIRA PostgreSQL Redshift Snowflake Tableau
Remote

Data Engineer

Q2

Cary, North Carolina, US 36 days ago
Python SQL Snowflake Apache Airflow dbt Kafka Terraform Kubernetes Docker Git CI/CD PostgreSQL AWS Glue Pyspark Databricks SageMaker

Data Engineer

Booz Allen Hamilton

Locations Fayetteville, North Carolina, US 55 days ago $77,500$176,000
ETL ELT data pipelines batch processing streaming workflows data catalog API integration OAuth2 Python SQL CI/CD