{ "cells": [ { "cell_type": "markdown", "id": "10fdb752-2248-4e3a-9678-2e0bf2288790", "metadata": {}, "source": [ "# Fine-tuning Borzoi to create a Decima model" ] }, { "cell_type": "code", "execution_count": 1, "id": "c6dbf5fc-85ca-42a8-b076-ba0313604e91", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:18.270014Z", "iopub.status.busy": "2025-11-21T22:29:18.269887Z", "iopub.status.idle": "2025-11-21T22:29:26.533418Z", "shell.execute_reply": "2025-11-21T22:29:26.532666Z" } }, "outputs": [], "source": [ "import glob\n", "import anndata\n", "import scanpy as sc\n", "import pandas as pd\n", "import bioframe as bf\n", "import os" ] }, { "cell_type": "code", "execution_count": 2, "id": "a8563bf1-0305-437b-81fa-0584753c5793", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:26.535200Z", "iopub.status.busy": "2025-11-21T22:29:26.534846Z", "iopub.status.idle": "2025-11-21T22:29:26.537436Z", "shell.execute_reply": "2025-11-21T22:29:26.537029Z" } }, "outputs": [], "source": [ "inputdir = \"./data\"\n", "outdir = \"./example\"\n", "ad_file_path = os.path.join(inputdir, \"data.h5ad\")\n", "h5_file_path = os.path.join(outdir, \"data.h5\")" ] }, { "cell_type": "markdown", "id": "4215ffc0-6a14-44b4-b522-7d4322a7cafe", "metadata": {}, "source": [ "## 1. Load input anndata file" ] }, { "cell_type": "markdown", "id": "f36c3e31-e447-42b8-b785-5d75b1a1007f", "metadata": {}, "source": [ "The input anndata file needs to be in the format (pseudobulks x genes)." ] }, { "cell_type": "code", "execution_count": 3, "id": "83273dca-0622-42e2-a606-b645e0a31f19", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:26.538717Z", "iopub.status.busy": "2025-11-21T22:29:26.538588Z", "iopub.status.idle": "2025-11-21T22:29:26.580706Z", "shell.execute_reply": "2025-11-21T22:29:26.580293Z" } }, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 50 × 921\n", " obs: 'cell_type', 'tissue', 'disease', 'study'\n", " var: 'chrom', 'start', 'end', 'strand', 'gene_start', 'gene_end', 'gene_length', 'gene_mask_start', 'gene_mask_end', 'dataset'\n", " uns: 'log1p'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ad = sc.read(ad_file_path)\n", "ad" ] }, { "cell_type": "markdown", "id": "dcb6c9a7-5e97-46fc-a2d3-fe029821c375", "metadata": {}, "source": [ "`.obs` should be a dataframe with a unique index per pseudobulk. You can also include other columns with metadata about the pseudobulks, e.g. cell type, tissue, disease, study, number of cells, total counts. \n", "\n", "Note that the original Decima model does NOT separate pseudobulks by sample, i.e. different samples from the same cell type, tissue, disease and study were merged. We also recommend filtering out pseudobulks with few cells or low read count. " ] }, { "cell_type": "code", "execution_count": 4, "id": "e29ca0c0-5f61-4146-b187-d11cc57373d0", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:26.581807Z", "iopub.status.busy": "2025-11-21T22:29:26.581680Z", "iopub.status.idle": "2025-11-21T22:29:26.602279Z", "shell.execute_reply": "2025-11-21T22:29:26.601917Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cell_typetissuediseasestudy
pseudobulk_0ct_0t_0d_0st_0
pseudobulk_1ct_0t_0d_1st_0
pseudobulk_2ct_0t_0d_2st_1
pseudobulk_3ct_0t_0d_0st_1
pseudobulk_4ct_0t_0d_1st_2
\n", "
" ], "text/plain": [ " cell_type tissue disease study\n", "pseudobulk_0 ct_0 t_0 d_0 st_0\n", "pseudobulk_1 ct_0 t_0 d_1 st_0\n", "pseudobulk_2 ct_0 t_0 d_2 st_1\n", "pseudobulk_3 ct_0 t_0 d_0 st_1\n", "pseudobulk_4 ct_0 t_0 d_1 st_2" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ad.obs.head()" ] }, { "cell_type": "markdown", "id": "ab69e185-0d58-41e1-8d04-c60a4ed24ef5", "metadata": {}, "source": [ "`.var` should be a dataframe with a unique index per gene. The index can be the gene name or Ensembl ID, as long as it is unique. Other essential columns are: chrom, start, end and strand (the gene coordinates).\n", "\n", "You can also include other columns with metadata about the genes, e.g. Ensembl ID, type of gene." ] }, { "cell_type": "code", "execution_count": 5, "id": "a79a70c0-5a33-46dc-b363-4e9df6ab2b8a", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:26.603573Z", "iopub.status.busy": "2025-11-21T22:29:26.603444Z", "iopub.status.idle": "2025-11-21T22:29:26.609192Z", "shell.execute_reply": "2025-11-21T22:29:26.608798Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendstrandgene_startgene_endgene_lengthgene_mask_startgene_mask_enddataset
gene_0chr12635484026879128+2651868027042968524288163840524288train
gene_1chr194111141741635705-4094757741471865524288163840524288train
gene_2chr17977402680298314-7961018680134474524288163840524288train
gene_4chr1637413684265656-35775284101816524288163840524288train
gene_5chr102265948123183769+2282332123347609524288163840524288train
\n", "
" ], "text/plain": [ " chrom start end strand gene_start gene_end gene_length \\\n", "gene_0 chr1 26354840 26879128 + 26518680 27042968 524288 \n", "gene_1 chr19 41111417 41635705 - 40947577 41471865 524288 \n", "gene_2 chr1 79774026 80298314 - 79610186 80134474 524288 \n", "gene_4 chr16 3741368 4265656 - 3577528 4101816 524288 \n", "gene_5 chr10 22659481 23183769 + 22823321 23347609 524288 \n", "\n", " gene_mask_start gene_mask_end dataset \n", "gene_0 163840 524288 train \n", "gene_1 163840 524288 train \n", "gene_2 163840 524288 train \n", "gene_4 163840 524288 train \n", "gene_5 163840 524288 train " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ad.var.head()" ] }, { "cell_type": "markdown", "id": "55cd3f29-8bf7-47f8-8942-bd906a856ab7", "metadata": {}, "source": [ "`.X` should contain the total counts per gene and pseudobulk. These should be non-negative integers." ] }, { "cell_type": "code", "execution_count": 6, "id": "3fd9bd2a-e728-4dc5-9c90-a9558fab0e27", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:26.610372Z", "iopub.status.busy": "2025-11-21T22:29:26.610248Z", "iopub.status.idle": "2025-11-21T22:29:26.613043Z", "shell.execute_reply": "2025-11-21T22:29:26.612649Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[0. , 7.2926292, 7.2926292, 7.2926292, 7.2926292],\n", " [7.3133874, 7.3133874, 0. , 7.3133874, 7.3133874],\n", " [7.299993 , 7.299993 , 7.299993 , 7.299993 , 0. ],\n", " [7.299993 , 0. , 7.299993 , 7.299993 , 0. ],\n", " [7.3376517, 7.3376517, 0. , 7.3376517, 7.3376517]],\n", " dtype=float32)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ad.X[:5, :5]" ] }, { "cell_type": "markdown", "id": "9514b0b9-9cc3-48f0-9e70-897c1cb55962", "metadata": {}, "source": [ "## 2. Normalize and log transform data" ] }, { "cell_type": "markdown", "id": "cbca4273-1752-47dd-9b3a-9b29266787e3", "metadata": {}, "source": [ "We first transform the counts to log(CPM+1) values. CPM = Counts Per Million." ] }, { "cell_type": "code", "execution_count": 7, "id": "34115f7a-aaf8-4ca3-abbb-a4fc552bf5a7", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:26.614156Z", "iopub.status.busy": "2025-11-21T22:29:26.614039Z", "iopub.status.idle": "2025-11-21T22:29:26.616992Z", "shell.execute_reply": "2025-11-21T22:29:26.616589Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: adata.X seems to be already log-transformed.\n" ] } ], "source": [ "sc.pp.normalize_total(ad, target_sum=1e6)\n", "sc.pp.log1p(ad)" ] }, { "cell_type": "code", "execution_count": 8, "id": "e42a91c7-ac01-45b3-8d3b-6c99baf7adff", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:26.618117Z", "iopub.status.busy": "2025-11-21T22:29:26.617994Z", "iopub.status.idle": "2025-11-21T22:29:26.620685Z", "shell.execute_reply": "2025-11-21T22:29:26.620286Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[0. , 7.295568 , 7.295568 , 7.295568 , 7.295568 ],\n", " [7.316388 , 7.316388 , 0. , 7.316388 , 7.316388 ],\n", " [7.3014727, 7.3014727, 7.3014727, 7.3014727, 0. ],\n", " [7.3014727, 0. , 7.3014727, 7.3014727, 0. ],\n", " [7.3407264, 7.3407264, 0. , 7.3407264, 7.3407264]],\n", " dtype=float32)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ad.X[:5, :5]" ] }, { "cell_type": "markdown", "id": "1556f595-d4c7-4944-905e-060e4ae1c4f6", "metadata": {}, "source": [ "## 3. Create intervals surrounding genes" ] }, { "cell_type": "markdown", "id": "ff51db62-9c1d-4af7-b188-fed7b038e3fa", "metadata": {}, "source": [ "Decima is trained on 524,288 bp sequence surrounding the genes. Therefore, we have to take the given gene coordinates and extend them to create intervals of this length." ] }, { "cell_type": "code", "execution_count": 9, "id": "86905140-4b30-424b-91ce-090a0a56ebab", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:26.621914Z", "iopub.status.busy": "2025-11-21T22:29:26.621802Z", "iopub.status.idle": "2025-11-21T22:29:40.486339Z", "shell.execute_reply": "2025-11-21T22:29:40.485817Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/celikm5/miniforge3/envs/decima2/lib/python3.11/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'repr' attribute with value False was provided to the `Field()` function, which has no effect in the context it was used. 'repr' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.\n", " warnings.warn(\n", "/home/celikm5/miniforge3/envs/decima2/lib/python3.11/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'frozen' attribute with value True was provided to the `Field()` function, which has no effect in the context it was used. 'frozen' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.\n", " warnings.warn(\n" ] } ], "source": [ "from decima.data.preprocess import var_to_intervals" ] }, { "cell_type": "code", "execution_count": 10, "id": "d027eb23-48d6-40d9-9e30-b5c7691a7c53", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:40.487901Z", "iopub.status.busy": "2025-11-21T22:29:40.487652Z", "iopub.status.idle": "2025-11-21T22:29:40.494114Z", "shell.execute_reply": "2025-11-21T22:29:40.493712Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendstrandgene_startgene_endgene_lengthgene_mask_startgene_mask_enddataset
gene_0chr12635484026879128+2651868027042968524288163840524288train
gene_1chr194111141741635705-4094757741471865524288163840524288train
gene_2chr17977402680298314-7961018680134474524288163840524288train
gene_4chr1637413684265656-35775284101816524288163840524288train
gene_5chr102265948123183769+2282332123347609524288163840524288train
\n", "
" ], "text/plain": [ " chrom start end strand gene_start gene_end gene_length \\\n", "gene_0 chr1 26354840 26879128 + 26518680 27042968 524288 \n", "gene_1 chr19 41111417 41635705 - 40947577 41471865 524288 \n", "gene_2 chr1 79774026 80298314 - 79610186 80134474 524288 \n", "gene_4 chr16 3741368 4265656 - 3577528 4101816 524288 \n", "gene_5 chr10 22659481 23183769 + 22823321 23347609 524288 \n", "\n", " gene_mask_start gene_mask_end dataset \n", "gene_0 163840 524288 train \n", "gene_1 163840 524288 train \n", "gene_2 163840 524288 train \n", "gene_4 163840 524288 train \n", "gene_5 163840 524288 train " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ad.var.head()" ] }, { "cell_type": "markdown", "id": "0f067d6e-f0d7-48f9-b973-032fac069159", "metadata": {}, "source": [ "First, we copy the start and end columns to `gene_start` and `gene_end`. We also create a new column `gene_length`. " ] }, { "cell_type": "code", "execution_count": 11, "id": "566977ab-041f-4a3d-b10e-9b6fa717c98e", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:40.495318Z", "iopub.status.busy": "2025-11-21T22:29:40.495198Z", "iopub.status.idle": "2025-11-21T22:29:40.498415Z", "shell.execute_reply": "2025-11-21T22:29:40.498008Z" } }, "outputs": [], "source": [ "ad.var[\"gene_start\"] = ad.var.start.tolist()\n", "ad.var[\"gene_end\"] = ad.var.end.tolist()\n", "ad.var[\"gene_length\"] = ad.var[\"gene_end\"] - ad.var[\"gene_start\"]" ] }, { "cell_type": "code", "execution_count": 12, "id": "e23e95dd-6616-4f79-8d1b-3a0fe22816d0", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:40.499458Z", "iopub.status.busy": "2025-11-21T22:29:40.499339Z", "iopub.status.idle": "2025-11-21T22:29:40.504838Z", "shell.execute_reply": "2025-11-21T22:29:40.504440Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendstrandgene_startgene_endgene_lengthgene_mask_startgene_mask_enddataset
gene_0chr12635484026879128+2635484026879128524288163840524288train
gene_1chr194111141741635705-4111141741635705524288163840524288train
gene_2chr17977402680298314-7977402680298314524288163840524288train
gene_4chr1637413684265656-37413684265656524288163840524288train
gene_5chr102265948123183769+2265948123183769524288163840524288train
\n", "
" ], "text/plain": [ " chrom start end strand gene_start gene_end gene_length \\\n", "gene_0 chr1 26354840 26879128 + 26354840 26879128 524288 \n", "gene_1 chr19 41111417 41635705 - 41111417 41635705 524288 \n", "gene_2 chr1 79774026 80298314 - 79774026 80298314 524288 \n", "gene_4 chr16 3741368 4265656 - 3741368 4265656 524288 \n", "gene_5 chr10 22659481 23183769 + 22659481 23183769 524288 \n", "\n", " gene_mask_start gene_mask_end dataset \n", "gene_0 163840 524288 train \n", "gene_1 163840 524288 train \n", "gene_2 163840 524288 train \n", "gene_4 163840 524288 train \n", "gene_5 163840 524288 train " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ad.var.head()" ] }, { "cell_type": "markdown", "id": "33edbd2f-66f2-48db-ae30-f9aef55c78c3", "metadata": {}, "source": [ "Now, we extend the gene coordinates to create enclosing intervals:" ] }, { "cell_type": "code", "execution_count": 13, "id": "c40ebdb5-2d8c-4cce-9685-61d5db0123f3", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:40.505916Z", "iopub.status.busy": "2025-11-21T22:29:40.505798Z", "iopub.status.idle": "2025-11-21T22:29:40.664023Z", "shell.execute_reply": "2025-11-21T22:29:40.663629Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The interval size is 524288 bases. Of these, 163840 will be upstream of the gene start and 360448 will be downstream of the gene start.\n", "0 intervals extended beyond the chromosome start and have been shifted\n", "1 intervals extended beyond the chromosome end and have been shifted\n", "1 intervals did not extend far enough upstream of the TSS and have been dropped\n" ] } ], "source": [ "ad = var_to_intervals(ad, chr_end_pad=10000, genome=\"hg38\")\n", "# Replace genome name if necessary" ] }, { "cell_type": "code", "execution_count": 14, "id": "191ed1e0-d34f-4aa4-a8e7-bc5919642528", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:40.665110Z", "iopub.status.busy": "2025-11-21T22:29:40.664988Z", "iopub.status.idle": "2025-11-21T22:29:40.670440Z", "shell.execute_reply": "2025-11-21T22:29:40.670046Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendstrandgene_startgene_endgene_lengthgene_mask_startgene_mask_enddataset
gene_0chr12619100026715288+2635484026879128524288163840524288train
gene_1chr194127525741799545-4111141741635705524288163840524288train
gene_2chr17993786680462154-7977402680298314524288163840524288train
gene_4chr1639052084429496-37413684265656524288163840524288train
gene_5chr102249564123019929+2265948123183769524288163840524288train
\n", "
" ], "text/plain": [ " chrom start end strand gene_start gene_end gene_length \\\n", "gene_0 chr1 26191000 26715288 + 26354840 26879128 524288 \n", "gene_1 chr19 41275257 41799545 - 41111417 41635705 524288 \n", "gene_2 chr1 79937866 80462154 - 79774026 80298314 524288 \n", "gene_4 chr16 3905208 4429496 - 3741368 4265656 524288 \n", "gene_5 chr10 22495641 23019929 + 22659481 23183769 524288 \n", "\n", " gene_mask_start gene_mask_end dataset \n", "gene_0 163840 524288 train \n", "gene_1 163840 524288 train \n", "gene_2 163840 524288 train \n", "gene_4 163840 524288 train \n", "gene_5 163840 524288 train " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ad.var.head()" ] }, { "cell_type": "markdown", "id": "5733a394-28f5-487e-b967-18aad52cf423", "metadata": {}, "source": [ "You see that the columns `start` and `end` now contain the start and end coordinates for the 524,288 bp intervals." ] }, { "cell_type": "markdown", "id": "1a8df107-f38e-428b-8bda-47101708ebc7", "metadata": {}, "source": [ "## 3. Split genes into training, validation and test sets" ] }, { "cell_type": "markdown", "id": "747d3b4b-784f-4735-af81-ab5db4002a9d", "metadata": {}, "source": [ "We load the coordinates of the genomic regions used to train Borzoi:" ] }, { "cell_type": "code", "execution_count": 15, "id": "8f4db77e-71f1-4457-b5cf-8c5f8f98ad06", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:40.671608Z", "iopub.status.busy": "2025-11-21T22:29:40.671488Z", "iopub.status.idle": "2025-11-21T22:29:40.876755Z", "shell.execute_reply": "2025-11-21T22:29:40.876312Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendfold
0chr48252442182721029fold0
1chr131860479818801406fold0
2chr2189923408190120016fold0
3chr105987574360072351fold0
4chr1117109467117306075fold0
\n", "
" ], "text/plain": [ " chrom start end fold\n", "0 chr4 82524421 82721029 fold0\n", "1 chr13 18604798 18801406 fold0\n", "2 chr2 189923408 190120016 fold0\n", "3 chr10 59875743 60072351 fold0\n", "4 chr1 117109467 117306075 fold0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "splits_file = \"https://raw.githubusercontent.com/calico/borzoi/main/data/sequences_human.bed.gz\"\n", "# replace human with mouse for mm10 splits\n", "splits = pd.read_table(splits_file, header=None, names=[\"chrom\", \"start\", \"end\", \"fold\"])\n", "splits.head()" ] }, { "cell_type": "markdown", "id": "e7229a5f-d27e-48b3-8590-a23474635542", "metadata": {}, "source": [ "Now, we overlap our gene intervals with these regions:" ] }, { "cell_type": "code", "execution_count": 16, "id": "99d1a382-1384-41ea-b59a-bb4aaa26caba", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:40.877970Z", "iopub.status.busy": "2025-11-21T22:29:40.877840Z", "iopub.status.idle": "2025-11-21T22:29:40.907310Z", "shell.execute_reply": "2025-11-21T22:29:40.906870Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genefold_
0gene_0fold5
15gene_1fold0
30gene_2fold0
44gene_4fold2
59gene_5fold2
\n", "
" ], "text/plain": [ " gene fold_\n", "0 gene_0 fold5\n", "15 gene_1 fold0\n", "30 gene_2 fold0\n", "44 gene_4 fold2\n", "59 gene_5 fold2" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "overlaps = bf.overlap(ad.var.reset_index(names=\"gene\"), splits, how=\"left\")\n", "overlaps = overlaps[[\"gene\", \"fold_\"]].drop_duplicates().astype(str)\n", "overlaps.head()" ] }, { "cell_type": "markdown", "id": "26a1a415-5299-48d6-94e0-7ee008d2bcc3", "metadata": {}, "source": [ "Based on the overlap, we divide our gene intervals into training, validation and test sets." ] }, { "cell_type": "code", "execution_count": 17, "id": "14180e40-c296-4f6f-b353-a07f383a7aae", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:40.908528Z", "iopub.status.busy": "2025-11-21T22:29:40.908396Z", "iopub.status.idle": "2025-11-21T22:29:40.911444Z", "shell.execute_reply": "2025-11-21T22:29:40.910959Z" } }, "outputs": [], "source": [ "test_genes = overlaps.gene[overlaps.fold_ == \"fold3\"].tolist()\n", "val_genes = overlaps.gene[overlaps.fold_ == \"fold4\"].tolist()\n", "train_genes = set(overlaps.gene).difference(set(test_genes).union(val_genes))" ] }, { "cell_type": "markdown", "id": "d6dc73a9-42d2-46c9-8176-34edbd70125d", "metadata": {}, "source": [ "And add this information back to `ad.var`." ] }, { "cell_type": "code", "execution_count": 18, "id": "ed8411d8-aecf-4196-bcc3-7635c1b4f34a", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:40.912530Z", "iopub.status.busy": "2025-11-21T22:29:40.912405Z", "iopub.status.idle": "2025-11-21T22:29:40.918001Z", "shell.execute_reply": "2025-11-21T22:29:40.917575Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/tmp/slurmjob.14477843/ipykernel_3516462/3109841685.py:1: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.\n" ] } ], "source": [ "ad.var[\"dataset\"] = \"test\"\n", "ad.var.loc[ad.var.index.isin(val_genes), \"dataset\"] = \"val\"\n", "ad.var.loc[ad.var.index.isin(train_genes), \"dataset\"] = \"train\"" ] }, { "cell_type": "code", "execution_count": 19, "id": "6e71f3f9-61d5-496d-8b04-661d5e97c2b8", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:40.919025Z", "iopub.status.busy": "2025-11-21T22:29:40.918905Z", "iopub.status.idle": "2025-11-21T22:29:40.924425Z", "shell.execute_reply": "2025-11-21T22:29:40.924026Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendstrandgene_startgene_endgene_lengthgene_mask_startgene_mask_enddataset
gene_0chr12619100026715288+2635484026879128524288163840524288train
gene_1chr194127525741799545-4111141741635705524288163840524288train
gene_2chr17993786680462154-7977402680298314524288163840524288train
gene_4chr1639052084429496-37413684265656524288163840524288train
gene_5chr102249564123019929+2265948123183769524288163840524288train
\n", "
" ], "text/plain": [ " chrom start end strand gene_start gene_end gene_length \\\n", "gene_0 chr1 26191000 26715288 + 26354840 26879128 524288 \n", "gene_1 chr19 41275257 41799545 - 41111417 41635705 524288 \n", "gene_2 chr1 79937866 80462154 - 79774026 80298314 524288 \n", "gene_4 chr16 3905208 4429496 - 3741368 4265656 524288 \n", "gene_5 chr10 22495641 23019929 + 22659481 23183769 524288 \n", "\n", " gene_mask_start gene_mask_end dataset \n", "gene_0 163840 524288 train \n", "gene_1 163840 524288 train \n", "gene_2 163840 524288 train \n", "gene_4 163840 524288 train \n", "gene_5 163840 524288 train " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ad.var.head()" ] }, { "cell_type": "code", "execution_count": 20, "id": "894e5f6f-1d0b-4c71-8f11-fcea38bea97d", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:40.925651Z", "iopub.status.busy": "2025-11-21T22:29:40.925524Z", "iopub.status.idle": "2025-11-21T22:29:40.928822Z", "shell.execute_reply": "2025-11-21T22:29:40.928340Z" } }, "outputs": [ { "data": { "text/plain": [ "dataset\n", "train 766\n", "test 83\n", "val 71\n", "Name: count, dtype: int64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ad.var.dataset.value_counts()" ] }, { "cell_type": "markdown", "id": "0a99d7e9-6b69-42ed-ae2b-fda6c0229035", "metadata": {}, "source": [ "We have now divided the 1000 genes in our dataset into separate sets to be used for training, validation and testing." ] }, { "cell_type": "markdown", "id": "c3ce727d-ae1a-4e9b-bf57-c6856e4e21e7", "metadata": {}, "source": [ "## 4. Save processed anndata" ] }, { "cell_type": "markdown", "id": "f950a054-a749-4a6b-a0c5-11b162e1babe", "metadata": {}, "source": [ "We will save the processed anndata file containing these intervals and data splits." ] }, { "cell_type": "code", "execution_count": 21, "id": "e55d7251-5372-4744-a1f3-411223d4eb35", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:40.929937Z", "iopub.status.busy": "2025-11-21T22:29:40.929820Z", "iopub.status.idle": "2025-11-21T22:29:40.998651Z", "shell.execute_reply": "2025-11-21T22:29:40.998240Z" } }, "outputs": [], "source": [ "ad.write_h5ad(ad_file_path)" ] }, { "cell_type": "markdown", "id": "a348fd0c-746f-4f3f-9c22-1cef3220460c", "metadata": {}, "source": [ "## 5. Create an hdf5 file" ] }, { "cell_type": "markdown", "id": "f5b813a4-ad04-421c-9e13-a5cbb4c4885b", "metadata": {}, "source": [ "To train Decima, we need to extract the genomic sequences for all the intervals and convert them to one-hot encoded format. We save these one-hot encoded inputs to an hdf5 file." ] }, { "cell_type": "code", "execution_count": 22, "id": "3b1473fd-fd72-41e7-a673-92fb474440ec", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:40.999863Z", "iopub.status.busy": "2025-11-21T22:29:40.999743Z", "iopub.status.idle": "2025-11-21T22:29:41.024192Z", "shell.execute_reply": "2025-11-21T22:29:41.023764Z" } }, "outputs": [], "source": [ "from decima.data.write_hdf5 import write_hdf5" ] }, { "cell_type": "code", "execution_count": 23, "id": "925184a3", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:41.025478Z", "iopub.status.busy": "2025-11-21T22:29:41.025360Z", "iopub.status.idle": "2025-11-21T22:29:41.191312Z", "shell.execute_reply": "2025-11-21T22:29:41.190672Z" } }, "outputs": [], "source": [ "! mkdir -p example" ] }, { "cell_type": "code", "execution_count": 24, "id": "81e3d301-db22-4a00-9cdc-f6ec7c641030", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:29:41.192756Z", "iopub.status.busy": "2025-11-21T22:29:41.192597Z", "iopub.status.idle": "2025-11-21T22:30:41.442605Z", "shell.execute_reply": "2025-11-21T22:30:41.442010Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Writing metadata\n", "Writing task indices\n", "Writing genes array of shape: (920, 2)\n", "Writing labels array of shape: (920, 50, 1)\n", "Making gene masks\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Writing mask array of shape: (920, 534288)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Encoding sequences\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Writing sequence array of shape: (920, 534288)\n", "Done!\n" ] } ], "source": [ "write_hdf5(file=h5_file_path, ad=ad, pad=5000, genome=\"hg38\")\n", "# Change genome name if necessary" ] }, { "cell_type": "markdown", "id": "2e7ebe9c-a523-4aff-8bcd-3faf46917caf", "metadata": {}, "source": [ "## 6. Set training parameters" ] }, { "cell_type": "code", "execution_count": 25, "id": "d9ca9ee3-c90a-4848-b9ed-899f21fbf39e", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:30:41.444116Z", "iopub.status.busy": "2025-11-21T22:30:41.443952Z", "iopub.status.idle": "2025-11-21T22:30:41.446732Z", "shell.execute_reply": "2025-11-21T22:30:41.446334Z" } }, "outputs": [], "source": [ "# Learning rate default=0.001\n", "lr = 5e-5\n", "# Total weight parameter for the loss function\n", "total_weight = 1e-4\n", "# Gradient accumulation steps\n", "grad = 5\n", "# batch-size. default=4\n", "bs = 4\n", "# max-seq-shift. default=5000\n", "shift = 5000\n", "# Number of epochs. Default 1\n", "epochs = 15\n", "\n", "# logger\n", "logger = \"wandb\" # Change to csv to save logs locally\n", "\n", "# Number of workers default=16\n", "workers = 16" ] }, { "cell_type": "markdown", "id": "f74beb10-0045-4c3c-9bc3-b5037053241b", "metadata": {}, "source": [ "## 7. Generate training commands" ] }, { "cell_type": "code", "execution_count": 26, "id": "676bba10-feaf-4bcf-ae76-91cf163ac26a", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:30:41.447823Z", "iopub.status.busy": "2025-11-21T22:30:41.447700Z", "iopub.status.idle": "2025-11-21T22:30:41.450388Z", "shell.execute_reply": "2025-11-21T22:30:41.450000Z" } }, "outputs": [], "source": [ "cmds = []\n", "\n", "for model in range(4):\n", " name = f\"finetune_test_{model}\"\n", " device = model\n", "\n", " cmd = (\n", " f\"decima finetune --name {name} \"\n", " + f\"--model {model} --device {device} \"\n", " + f\"--matrix-file {ad_file_path} --h5-file {h5_file_path} \"\n", " + f\"--outdir {outdir} --learning-rate {lr} \"\n", " + f\"--loss-total-weight {total_weight} --gradient-accumulation {grad} \"\n", " + f\"--batch-size {bs} --max-seq-shift {shift} \"\n", " + f\"--epochs {epochs} --logger {logger} --num-workers {workers}\"\n", " )\n", " cmds.append(cmd)" ] }, { "cell_type": "code", "execution_count": 27, "id": "6aee97bd-7b06-4af5-834e-c0f08127e75c", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:30:41.451403Z", "iopub.status.busy": "2025-11-21T22:30:41.451285Z", "iopub.status.idle": "2025-11-21T22:30:41.453392Z", "shell.execute_reply": "2025-11-21T22:30:41.452977Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "decima finetune --name finetune_test_0 --model 0 --device 0 --matrix-file ./data/data.h5ad --h5-file ./example/data.h5 --outdir ./example --learning-rate 5e-05 --loss-total-weight 0.0001 --gradient-accumulation 5 --batch-size 4 --max-seq-shift 5000 --epochs 15 --logger wandb --num-workers 16\n", "decima finetune --name finetune_test_1 --model 1 --device 1 --matrix-file ./data/data.h5ad --h5-file ./example/data.h5 --outdir ./example --learning-rate 5e-05 --loss-total-weight 0.0001 --gradient-accumulation 5 --batch-size 4 --max-seq-shift 5000 --epochs 15 --logger wandb --num-workers 16\n", "decima finetune --name finetune_test_2 --model 2 --device 2 --matrix-file ./data/data.h5ad --h5-file ./example/data.h5 --outdir ./example --learning-rate 5e-05 --loss-total-weight 0.0001 --gradient-accumulation 5 --batch-size 4 --max-seq-shift 5000 --epochs 15 --logger wandb --num-workers 16\n", "decima finetune --name finetune_test_3 --model 3 --device 3 --matrix-file ./data/data.h5ad --h5-file ./example/data.h5 --outdir ./example --learning-rate 5e-05 --loss-total-weight 0.0001 --gradient-accumulation 5 --batch-size 4 --max-seq-shift 5000 --epochs 15 --logger wandb --num-workers 16\n" ] } ], "source": [ "for cmd in cmds:\n", " print(cmd)" ] }, { "cell_type": "markdown", "id": "4133e741", "metadata": {}, "source": [ "Here, we train the model for 1 epoch for quick progressing in tutorial. Run the training for more epochs in your training." ] }, { "cell_type": "code", "execution_count": 28, "id": "d0fdaa9d", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:30:41.454479Z", "iopub.status.busy": "2025-11-21T22:30:41.454355Z", "iopub.status.idle": "2025-11-21T22:35:16.456721Z", "shell.execute_reply": "2025-11-21T22:35:16.455981Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/celikm5/miniforge3/envs/decima2/lib/python3.11/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'repr' attribute with value False was provided to the `Field()` function, which has no effect in the context it was used. 'repr' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.\r\n", " warnings.warn(\r\n", "/home/celikm5/miniforge3/envs/decima2/lib/python3.11/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'frozen' attribute with value True was provided to the `Field()` function, which has no effect in the context it was used. 'frozen' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.\r\n", " warnings.warn(\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "decima - INFO - Data paths: matrix_file=./data/data.h5ad, h5_file=./example/data.h5\r\n", "decima - INFO - Reading anndata\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "decima - INFO - Making dataset objects\r\n", "decima - INFO - train_params: {'batch_size': 1, 'num_workers': 16, 'devices': 0, 'logger': 'wandb', 'save_dir': './example', 'max_epochs': 1, 'lr': 5e-05, 'total_weight': 0.0001, 'accumulate_grad_batches': 5, 'loss': 'poisson_multinomial', 'clip': 0.0, 'save_top_k': 1, 'pin_memory': True}\r\n", "decima - INFO - model_params: {'n_tasks': 50, 'init_borzoi': True, 'replicate': '0'}\r\n", "decima - INFO - Initializing model\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "decima - INFO - Initializing weights from Borzoi model using wandb for replicate: 0\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34m\u001b[1mwandb\u001b[0m: Currently logged in as: \u001b[33mmhcelik\u001b[0m (\u001b[33mmhcw\u001b[0m) to \u001b[32mhttps://api.wandb.ai\u001b[0m. Use \u001b[1m`wandb login --relogin`\u001b[0m to force relogin\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact 'human_state_dict_fold0:latest', 709.30MB. 1 files...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \r\n", "Done. 00:00:01.7 (406.1MB/s)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "decima - INFO - Connecting to wandb.\r\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Currently logged in as: \u001b[33mmhcelik\u001b[0m (\u001b[33mmhcw\u001b[0m) to \u001b[32mhttps://genentech.wandb.io\u001b[0m. Use \u001b[1m`wandb login --relogin`\u001b[0m to force relogin\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[38;5;178m⢿\u001b[0m Waiting for wandb.init()...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[Am\u001b[2K\r", "\u001b[34m\u001b[1mwandb\u001b[0m: \u001b[38;5;178m⣻\u001b[0m setting up run g20ya0al (0.2s)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[Am\u001b[2K\r", "\u001b[34m\u001b[1mwandb\u001b[0m: Tracking run with wandb version 0.22.2\r\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Run data is saved locally in \u001b[35m\u001b[1mfinetune_test_0/wandb/run-20251121_143055-g20ya0al\u001b[0m\r\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Run \u001b[1m`wandb offline`\u001b[0m to turn off syncing.\r\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Syncing run \u001b[33mfinetune_test_0\u001b[0m\r\n", "\u001b[34m\u001b[1mwandb\u001b[0m: ⭐️ View project at \u001b[34m\u001b[4mhttps://genentech.wandb.io/grelu/decima\u001b[0m\r\n", "\u001b[34m\u001b[1mwandb\u001b[0m: 🚀 View run at \u001b[34m\u001b[4mhttps://genentech.wandb.io/grelu/decima/runs/g20ya0al\u001b[0m\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "decima - INFO - Training\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "/home/celikm5/miniforge3/envs/decima2/lib/python3.11/site-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Using 16bit Automatic Mixed Precision (AMP)\r\n", "GPU available: True (cuda), used: True\r\n", "TPU available: False, using: 0 TPU cores\r\n", "HPU available: False, using: 0 HPUs\r\n", "/home/celikm5/miniforge3/envs/decima2/lib/python3.11/site-packages/torch/utils/data/dataloader.py:627: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "/home/celikm5/miniforge3/envs/decima2/lib/python3.11/site-packages/pytorch_lightning/loggers/wandb.py:397: UserWarning: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.\r\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "SLURM auto-requeueing enabled. Setting signal handlers.\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "Validation: | | 0/? [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cell_typetissuediseasestudysize_factortrain_pearsonval_pearsontest_pearson
pseudobulk_0ct_0t_0d_0st_04946.3974610.0100200.1719440.122095
pseudobulk_1ct_0t_0d_1st_04858.091797-0.0241510.061900-0.169406
pseudobulk_2ct_0t_0d_2st_14921.1855470.007005-0.079252-0.094602
pseudobulk_3ct_0t_0d_0st_14928.4868160.016869-0.0230380.007967
pseudobulk_4ct_0t_0d_1st_24756.8193360.0502970.160398-0.101163
\n", "" ], "text/plain": [ " cell_type tissue disease study size_factor train_pearson \\\n", "pseudobulk_0 ct_0 t_0 d_0 st_0 4946.397461 0.010020 \n", "pseudobulk_1 ct_0 t_0 d_1 st_0 4858.091797 -0.024151 \n", "pseudobulk_2 ct_0 t_0 d_2 st_1 4921.185547 0.007005 \n", "pseudobulk_3 ct_0 t_0 d_0 st_1 4928.486816 0.016869 \n", "pseudobulk_4 ct_0 t_0 d_1 st_2 4756.819336 0.050297 \n", "\n", " val_pearson test_pearson \n", "pseudobulk_0 0.171944 0.122095 \n", "pseudobulk_1 0.061900 -0.169406 \n", "pseudobulk_2 -0.079252 -0.094602 \n", "pseudobulk_3 -0.023038 0.007967 \n", "pseudobulk_4 0.160398 -0.101163 " ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ad_out.obs.head()" ] }, { "cell_type": "code", "execution_count": 36, "id": "121a7787-4c74-465f-ae93-b529564cc2fa", "metadata": { "execution": { "iopub.execute_input": "2025-11-21T22:39:05.796225Z", "iopub.status.busy": "2025-11-21T22:39:05.796089Z", "iopub.status.idle": "2025-11-21T22:39:05.802450Z", "shell.execute_reply": "2025-11-21T22:39:05.801934Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromstartendstrandgene_startgene_endgene_lengthgene_mask_startgene_mask_enddatasetpearsonsize_factor_pearson
gene_0chr12619100026715288+2635484026879128524288163840524288train0.177304-0.062494
gene_1chr194127525741799545-4111141741635705524288163840524288train0.049450-0.037428
gene_2chr17993786680462154-7977402680298314524288163840524288train-0.0954390.240203
gene_4chr1639052084429496-37413684265656524288163840524288train-0.092946-0.042283
gene_5chr102249564123019929+2265948123183769524288163840524288train-0.310151-0.069181
\n", "
" ], "text/plain": [ " chrom start end strand gene_start gene_end gene_length \\\n", "gene_0 chr1 26191000 26715288 + 26354840 26879128 524288 \n", "gene_1 chr19 41275257 41799545 - 41111417 41635705 524288 \n", "gene_2 chr1 79937866 80462154 - 79774026 80298314 524288 \n", "gene_4 chr16 3905208 4429496 - 3741368 4265656 524288 \n", "gene_5 chr10 22495641 23019929 + 22659481 23183769 524288 \n", "\n", " gene_mask_start gene_mask_end dataset pearson size_factor_pearson \n", "gene_0 163840 524288 train 0.177304 -0.062494 \n", "gene_1 163840 524288 train 0.049450 -0.037428 \n", "gene_2 163840 524288 train -0.095439 0.240203 \n", "gene_4 163840 524288 train -0.092946 -0.042283 \n", "gene_5 163840 524288 train -0.310151 -0.069181 " ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ad_out.var.head()" ] } ], "metadata": { "kernelspec": { "display_name": "decima2", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 5 }