MLGenX Workshop @ ICLR 2026
The goal of this challenge is to test whether LLMs and agentic systems can do more than talk about biology — whether they can serve as useful computational engines for predicting cellular behavior.
MLGenX LLM Perturbation Challenge
The next wave of AI for science may emerge from the intersection of two frontier ideas: LLMs/agents as general reasoning systems, and virtual cells as predictive models of biological response. The MLGenX LLM Perturbation Challenge is designed to explore exactly that boundary.
In this challenge, participants tackle the focused but scientifically meaningful task of predicting the outcome of a Perturb-seq experiment: given a knockdown of gene X, predict whether a target gene Y will be up-regulated, down-regulated, or unchanged in activated macrophages. This framing turns perturbation biology into a benchmark for machine reasoning: can an AI system combine biological knowledge, causal intuition, and structured inference to anticipate the downstream effects of intervention?
The challenge is a controlled comparison of three emerging paradigms for scientific LLMs and agentic reasoning: prompt-guided reasoning, agentic tool use, and task-specific fine-tuning. In that sense, BioReasoningChallenge is not just a competition in predictive performance. It is a benchmark for a broader question shaping the future of computational biology: can frontier LLMs and agents become practical building blocks for virtual cell models?
The core task is a supervised predictive challenge where participants must infer the effect of a specific gene perturbation on the expression of a target gene. Structurally, the task is defined as: Perturbation gene X → gene Y up/down/no-change. Participants will be scored based on their prediction of these post-processed labels, which indicate whether gene Y is up-regulated, down-regulated, or unchanged following the perturbation of gene X.
The challenge consists of three tracks
Given a fixed LLM with no finetuning, provide a prompt which will encourage the model to give the best result.
Given a fixed LLM with no finetuning, provide a prompt and a set of tools which will maximize performance on the task.
Optimize the provided LLM to be an expert on this task.
The data consists of a set of ternary questions regarding the genetic perturbation effects in macrophage cells.
This dataset contains gene expression profiles from primary mouse bone marrow-derived macrophages (BMDMs), which are innate immune cells differentiated from bone marrow cells isolated from Cas9-transgenic mice. The cells were transduced with a pooled CRISPR knockout library targeting genes involved in inflammatory signaling and other macrophage-relevant processes, then stimulated with bacterial lipopolysaccharide (LPS) to induce an immune response. Nine hours after stimulation, single-cell RNA profiles were collected together with perturbation identities, enabling analysis of how specific gene knockouts (perturbation) alter macrophage transcriptional states during inflammation.
Differential expression analysis was performed on the top 10,000 highly variable features selected by scanpy. For each perturbation, we used glmGamPoi to fit a negative binomial regression comparing the perturbed cells against 2,000 random cells containing control perturbations (Olfactory receptor genes), while controlling for batch as a covariate. Each of the 4 guides per gene were pooled and counted as a single perturbation for this analysis. We performed shrinkage on the log-fold-changes by applying ebnm to the flattened log-fold-change matrix, assuming a point-normal prior. To assess false positive rates of the differential expression results, we generated 500 artificial perturbations by sampling random quartets of Olfactory receptor genes. Defining differential expressed genes (DEGs) using cutoffs of 5% FDR and |shrunken log2FC| ≥ log2(1.5), 471 (of 500) artificial perturbations had 0 DEGs, while 25 artificial perturbations had 1 DEG, and 4 had 2 DEGs. By contrast, the true perturbations had a median of 193 DEGs (IQR 56–516.5 DEGs).
Following the differential expression analysis, we processed the data similarly to the PerturbQA format, except with ternary labels. Differential expressed genes were determined as genes with FDR < 0.05 and |shrunken log2FC| ≥ log2(1.5). Perturbations with less than 9 differentially expressed genes (DEGs) were excluded, and then remaining perturbations were separated into 80% train, 10% val, 10% test splits. The full 10,000 possible target genes are partitioned to be used in different splits with 60% train / 20% val / 20% test. For each perturbation, up to 9 of the top DEGs are added to the dataset. A similar number of non-DEGs are then added using a negative sampling strategy to ensure the dataset is balanced. There are 482 perturbations and 2,206 target genes in total with 386/48/48 perturbations and 1,570/331/305 target genes belonging to the final train / val / test splits, respectively.
Micro AUROC across two evaluation tasks
The competition will be held on Kaggle. A submission will consist of:
Please see the Github repo for example submissions.
Submissions will be evaluated on:
It is permissible to submit continuous probability values as predictions instead of 1 or 0.
Top-performing teams’ solutions (Top 5) will undergo manual assessment by domain experts and LLM agents. This serves as a validity check to identify and disqualify models that may engage in metric hacking.
Recognition for top-performing teams
Additionally, the top teams will be invited as co-authors to contribute their insights, experience, and findings to a manuscript prepared after the challenge and submitted to a high-profile venue.
General policies and track-specific constraints
Within the rules of each track, participants are encouraged to be creative in how they approach the problem. We expect strong solutions to explore the full design space of LLMs and agentic systems, including structured reasoning, prompt decomposition strategies, retrieval and knowledge integration, tool use, multi-agent workflows, and efficient fine-tuning. The challenge is intended not only to measure performance, but also to surface new ideas.
This track is designed to evaluate prompt optimization (e.g., few shot learning, context optimization, etc.) strategies. The LLM itself must not be fine-tuned.
temperature=1.0, top_p=1.0.This track is designed to evaluate multi-agent tool use / tool calling. Teams should design the best set of tools and multi-agent architecture to optimize the task.
temperature=1.0, top_p=1.0.This track is designed to evaluate LLM finetuning (SFT, RL, etc.).
The team behind the challenge