BioReasoning Challenge 2026

MLGenX Workshop @ ICLR 2026

The goal of this challenge is to test whether LLMs and agentic systems can do more than talk about biology — whether they can serve as useful computational engines for predicting cellular behavior.

View Tracks Kaggle — Track A Kaggle — Track B Kaggle — Track C Get Started

Overview

MLGenX LLM Perturbation Challenge

The next wave of AI for science may emerge from the intersection of two frontier ideas: LLMs/agents as general reasoning systems, and virtual cells as predictive models of biological response. The MLGenX LLM Perturbation Challenge is designed to explore exactly that boundary.

In this challenge, participants tackle the focused but scientifically meaningful task of predicting the outcome of a Perturb-seq experiment: given a knockdown of gene X, predict whether a target gene Y will be up-regulated, down-regulated, or unchanged in activated macrophages. This framing turns perturbation biology into a benchmark for machine reasoning: can an AI system combine biological knowledge, causal intuition, and structured inference to anticipate the downstream effects of intervention?

The challenge is a controlled comparison of three emerging paradigms for scientific LLMs and agentic reasoning: prompt-guided reasoning, agentic tool use, and task-specific fine-tuning. In that sense, BioReasoningChallenge is not just a competition in predictive performance. It is a benchmark for a broader question shaping the future of computational biology: can frontier LLMs and agents become practical building blocks for virtual cell models?

Tasks – Perturbation Prediction

The core task is a supervised predictive challenge where participants must infer the effect of a specific gene perturbation on the expression of a target gene. Structurally, the task is defined as: Perturbation gene X → gene Y up/down/no-change. Participants will be scored based on their prediction of these post-processed labels, which indicate whether gene Y is up-regulated, down-regulated, or unchanged following the perturbation of gene X.

Tracks

The challenge consists of three tracks

Track A — Prompt-only (no tools, single call)

Given a fixed LLM with no finetuning, provide a prompt which will encourage the model to give the best result.

Fixed base LLM: GPT-OSS-120B
No tools, single call
3 samples per question (seeds 42, 43, 44)
Max 4,096 prompt tokens

Join on Kaggle →

Track B — (Multi)-agentic tool-use

Given a fixed LLM with no finetuning, provide a prompt and a set of tools which will maximize performance on the task.

Fixed base LLM: GPT-OSS-120B
Multi-agent design allowed
Max 100 distinct tools, 250 tool calls
Max 16,384 prompt tokens
Traces required for auditability

Join on Kaggle →

Track C — Fine-tuning (reasoning, no tools)

Optimize the provided LLM to be an expert on this task.

Open model < 10B parameters
Any kind of finetuning technique allowed (SFT/LoRA, RL etc.)
No tools, no web, and no external models at inference

Join on Kaggle →

Data

The data consists of a set of ternary questions regarding the genetic perturbation effects in macrophage cells.

Biological background

This dataset contains gene expression profiles from primary mouse bone marrow-derived macrophages (BMDMs), which are innate immune cells differentiated from bone marrow cells isolated from Cas9-transgenic mice. The cells were transduced with a pooled CRISPR knockout library targeting genes involved in inflammatory signaling and other macrophage-relevant processes, then stimulated with bacterial lipopolysaccharide (LPS) to induce an immune response. Nine hours after stimulation, single-cell RNA profiles were collected together with perturbation identities, enabling analysis of how specific gene knockouts (perturbation) alter macrophage transcriptional states during inflammation.

Preprocessing

Differential expression analysis was performed on the top 10,000 highly variable features selected by scanpy. For each perturbation, we used glmGamPoi to fit a negative binomial regression comparing the perturbed cells against 2,000 random cells containing control perturbations (Olfactory receptor genes), while controlling for batch as a covariate. Each of the 4 guides per gene were pooled and counted as a single perturbation for this analysis. We performed shrinkage on the log-fold-changes by applying ebnm to the flattened log-fold-change matrix, assuming a point-normal prior. To assess false positive rates of the differential expression results, we generated 500 artificial perturbations by sampling random quartets of Olfactory receptor genes. Defining differential expressed genes (DEGs) using cutoffs of 5% FDR and |shrunken log2FC| ≥ log2(1.5), 471 (of 500) artificial perturbations had 0 DEGs, while 25 artificial perturbations had 1 DEG, and 4 had 2 DEGs. By contrast, the true perturbations had a median of 193 DEGs (IQR 56–516.5 DEGs).

Following the differential expression analysis, we processed the data similarly to the PerturbQA format, except with ternary labels. Differential expressed genes were determined as genes with FDR < 0.05 and |shrunken log2FC| ≥ log2(1.5). Perturbations with less than 9 differentially expressed genes (DEGs) were excluded, and then remaining perturbations were separated into 80% train, 10% val, 10% test splits. The full 10,000 possible target genes are partitioned to be used in different splits with 60% train / 20% val / 20% test. For each perturbation, up to 9 of the top DEGs are added to the dataset. A similar number of non-DEGs are then added using a negative sampling strategy to ensure the dataset is balanced. There are 482 perturbations and 2,206 target genes in total with 386/48/48 perturbations and 1,570/331/305 target genes belonging to the final train / val / test splits, respectively.

Data details

Training data and splits

We provide a small train set which is representative of the test set. Further, participants may train on any publicly available perturbation datasets (e.g., PerturbQA and other public resources).
We provide an online leaderboard which shows predictions on a public portion of the test set drawn from the same experimental pipeline as the rest of the data.
At the end of the competition, the private test set scores will be released. Final rankings are determined on the held-out private test set.

Potential additional training data

Metrics / Evaluation

Micro AUROC across two evaluation tasks

The competition will be held on Kaggle. A submission will consist of:

A .txt file containing the predictions for the test set. Public leaderboard performance will be reported back.
Task-relevant metadata, such as reasoning traces, tokens used, and model parameters.
Task-relevant inputs, such as tool definitions (e.g., a python file) or prompts.

Please see the Github repo for example submissions.

Submissions will be evaluated on:

Micro AUROC of (1) differential expression and (2) direction of change (the evaluation pipeline is inspired by existing literature). The average of those two metrics will be used to rank entries.

It is permissible to submit continuous probability values as predictions instead of 1 or 0.

Top-performing teams’ solutions (Top 5) will undergo manual assessment by domain experts and LLM agents. This serves as a validity check to identify and disqualify models that may engage in metric hacking.

Awards

Recognition for top-performing teams

Per-track prizes (Tracks A, B, and C each)

1st place: $350
2nd place: $150
3rd place: $100

Special award

Most novel submission: $200

Additionally, the top teams will be invited as co-authors to contribute their insights, experience, and findings to a manuscript prepared after the challenge and submitted to a high-profile venue.

Rules

General policies and track-specific constraints

Overall, the challenge will follow a good-faith policy regarding cheating. This will be up to the discretion of the organizers. Please feel free to request any clarifications.
Each submission must include reasoning traces, final answers, token usage, tool usage (if applicable), and model size/type information.
Participants may train on any publicly available, permissively licenced, perturbation datasets (e.g., PerturbQA and other public resources).
Participants should adhere to track-specific rules in terms of base model and parameters, token limitations, etc.

Within the rules of each track, participants are encouraged to be creative in how they approach the problem. We expect strong solutions to explore the full design space of LLMs and agentic systems, including structured reasoning, prompt decomposition strategies, retrieval and knowledge integration, tool use, multi-agent workflows, and efficient fine-tuning. The challenge is intended not only to measure performance, but also to surface new ideas.

This track is designed to evaluate prompt optimization (e.g., few shot learning, context optimization, etc.) strategies. The LLM itself must not be fine-tuned.

Fixed base LLM, GPT-OSS-120B, with no fine-tuning.
- Enforced sampling parameters: temperature=1.0, top_p=1.0.
Task format: Question-Specific Prompt + input question → output.
Maximum number of prompt tokens: 4,096 tokens (this constraint is intended to encourage creativity and prevent dumping information into the prompt).
- The prompt cannot directly include the expected outputs.
3 samples per question with seeds 42, 43, 44.

This track is designed to evaluate multi-agent tool use / tool calling. Teams should design the best set of tools and multi-agent architecture to optimize the task.

Fixed base LLM, GPT-OSS-120B, with no fine-tuning.
- Enforced sampling parameters: temperature=1.0, top_p=1.0.
Task format: General Prompt + tool file + input question → output.
Maximum number of distinct tools: 100.
Maximum number of tool calls: 250.
Require sharing traces for auditability.
Maximum number of prompt tokens: 16,384 tokens.
Allowed tools: retrieval from public data (GRN queries, GO/pathway lookup, model access), predictive models trained with public data, and multi-agent design.
- Not allowed: no other LLMs outside GPT-OSS-120B.
- Not allowed: training/finetuning ad-hoc models on competition data.

This track is designed to evaluate LLM finetuning (SFT, RL, etc.).

Available open-source LLM to start with:
- Open model < 10B parameters.
Any kind of finetuning technique allowed (SFT/LoRA, RL etc.).
Task format: General prompt + input question → output.
No tools, no web, and no external models allowed during inference.

References

Organizers

The team behind the challenge

Carl Edwards

Senior AI Scientist, Genentech (BRAID)

Website →

Ehsan Hajiramezanali

Principal AI Scientist, Genentech

Scholar →

Takamasa Kudo

Genentech

Scholar →

Jack Kamm

Statistician & Bioinformaticist, Genentech

Website →

Hugues Van Assel

Postdoc, Genentech

Website →

Xiner Li

Senior AI Scientist, Genentech

Website →

Edward De Brouwer

Senior Machine Learning Scientist, Genentech

Website →

Jan-Christian Hütter

Website →

Gabriele Scalia

Principal AI Scientist, Genentech

LinkedIn →

Romain Lopez

Assistant Professor, NYU (Courant & Biology)

Website →

Sara Mostafavi

Head of AI for Biology & Translation, Genentech

Website →

Aviv Regev

Head & EVP, Genentech Research & Early Development

Genentech →