{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Variant Effect Prediction with Decima" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Decima's Variant Effect Prediction (VEP) module allows you to predict the effects of genetic variants on gene expression. This tutorial demonstrates how to use the VEP functionality through both command-line interface (CLI) and Python API. The VEP module takes variant file as input (in TSV or VCF format) and predicts their effects on gene expression across different cell types and tissues if provided." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## CLI API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CLI API for variant effect prediction on gene expression." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: decima vep [OPTIONS]\n", "\n", " Predict variant effect and save to parquet\n", "\n", " Examples:\n", "\n", " >>> decima vep -v \"data/sample.vcf\" -o \"vep_results.parquet\"\n", "\n", " >>> decima vep -v \"data/sample.vcf\" -o \"vep_results.parquet\" --tasks\n", " \"cell_type == 'classical monocyte'\" # only predict for classical\n", " monocytes\n", "\n", " >>> decima vep -v \"data/sample.vcf\" -o \"vep_results.parquet\" --device 0\n", " # use device gpu device 0\n", "\n", " >>> decima vep -v \"data/sample.vcf\" -o \"vep_results.parquet\" --include-\n", " cols \"gene_name,gene_id\" # include gene_name and gene_id columns in the\n", " output\n", "\n", " >>> decima vep -v \"data/sample.vcf\" -o \"vep_results.parquet\" --gene-col\n", " \"gene_name\" # use gene_name column as gene names if these option passed\n", " genes and variants mapped based on these column not based on the genomic\n", " locus based on the annotaiton.\n", "\n", "Options:\n", " -v, --variants PATH Path to the variant file .vcf file\n", " -o, --output_pq PATH Path to the output parquet file.\n", " --tasks TEXT Tasks to predict. If not provided, all tasks will\n", " be predicted.\n", " --chunksize INTEGER Number of variants to process in each chunk.\n", " Loading variants in chunks is more memory\n", " efficient.This chuck of variants will be process\n", " and saved to output parquet file before contineus\n", " to next chunk. Default: 10_000.\n", " --model INTEGER Model to use for variant effect prediction either\n", " replicate number or path to the model.\n", " --device TEXT Device to use. Default: None which automatically\n", " selects the best device.\n", " --batch-size INTEGER Batch size for the model. Default: 1.\n", " --num-workers INTEGER Number of workers for the loader. Default: 1.\n", " --max-distance FLOAT Maximum distance from the TSS. Default: 524288.\n", " --max-distance-type TEXT Type of maximum distance. Default: tss.\n", " --include-cols TEXT Columns to include in the output in the original\n", " tsv file to include in the output parquet file.\n", " Default: None.\n", " --gene-col TEXT Column name for gene names. Default: None.\n", " --genome TEXT Genome build. Default: hg38.\n", " --help Show this message and exit.\n", "\u001b[0m" ] } ], "source": [ "! decima vep --help" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The VEP module takes a VCF file as input, identifies variants near genes, and predicts their effects on gene expression in a cell type-specific manner. The results are saved as a parquet file containing the following columns:\n", "\n", "- chrom: Chromosome where the variant is located\n", "- pos: Genomic position of the variant\n", "- ref: Reference allele\n", "- alt: Alternative allele\n", "- gene: Gene name\n", "- start: Gene start position\n", "- end: Gene end position\n", "- strand: Gene strand\n", "- gene_mask_start: Start position of gene mask\n", "- gene_mask_end: End position of gene mask\n", "- rel_pos: Relative position within gene\n", "- ref_tx: Reference transcript\n", "- alt_tx: Alternative transcript\n", "- tss_dist: Distance to transcription start site\n", "- cell_0, cell_1, etc.: Predicted gene expression changes for each cell type" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! decima vep -v \"data/sample.vcf\" -o \"vep_vcf_results.parquet\"" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "unknown: 0 / 48 \n", "allele_mismatch_with_reference_genome: 26 / 48 \n" ] } ], "source": [ "! cat vep_vcf_results.parquet.warnings.log" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromposrefaltgenestartendstrandgene_mask_startgene_mask_end...agg_9528agg_9529agg_9530agg_9531agg_9532agg_9533agg_9535agg_9536agg_9537agg_9538
0chr11002308TCFAM41C5164551040743-163840172672...-0.000053-0.000153-0.000089-0.000016-0.000013-0.000040-0.0000707.629395e-05-0.000026-0.000069
1chr11002308TCNOC2L5988611123149-163840178946...-0.000586-0.000891-0.000565-0.000311-0.000461-0.000252-0.000376-6.060898e-04-0.000454-0.000725
2chr11002308TCPERM16216451145933-163840170729...-0.000565-0.000787-0.000515-0.000279-0.000354-0.000202-0.000342-5.399883e-04-0.000423-0.000597
3chr11002308TCHES46397241164012-163840165050...-0.001453-0.001775-0.001403-0.000820-0.000896-0.000575-0.001113-1.119316e-03-0.001036-0.001202
4chr11002308TCFAM87B6535311177819+163840166306...0.0000450.0001050.000060-0.000030-0.0000490.0000290.0001351.490116e-070.0000570.000001
5chr11002308TCRNF2237138581238146-163840167179...-0.000561-0.000826-0.000536-0.000299-0.000324-0.000196-0.000374-3.589988e-04-0.000342-0.000426
6chr11002308TCC1orf1597559131280201-163840198383...-0.000399-0.000585-0.000374-0.000208-0.000212-0.000126-0.000265-2.926290e-04-0.000266-0.000376
7chr11002308TCSAMD117600881284376+163840184493...0.0013170.0010010.0008840.0010130.0006910.0013100.0010975.598068e-040.0005010.000827
8chr11002308TCKLHL177967441321032+163840168975...-0.000244-0.000214-0.000310-0.000153-0.000324-0.000168-0.000147-4.823208e-04-0.000211-0.000401
9chr11002308TCPLEKHN18026421326930+163840173223...0.000012-0.0001130.000002-0.000028-0.000248-0.000283-0.000314-1.890063e-04-0.000022-0.000029
10chr11002308TCTTLL10-AS18191071343395-163840170339...-0.000286-0.000486-0.000312-0.000150-0.000256-0.000124-0.000179-3.057718e-04-0.000183-0.000252
11chr11002308TCISG158372981361586+163840177242...0.0075580.0049340.0079660.007595-0.0019430.0015980.0078382.528191e-030.0071370.009534
12chr11002308TCTNFRSF188461441370432-163840166924...-0.000190-0.000248-0.000215-0.000113-0.000070-0.000076-0.000159-5.897880e-05-0.000117-0.000182
13chr11002308TCTNFRSF48537051377993-163840166653...-0.000217-0.000332-0.000225-0.000132-0.000127-0.000102-0.000146-2.127588e-04-0.000168-0.000276
14chr11002308TCAGRN8562801380568+163840199838...-0.000643-0.000886-0.000513-0.001688-0.000346-0.000528-0.000894-4.293919e-04-0.001309-0.000301
15chr11002308TCSDF48716191395907-163840178999...-0.000031-0.000034-0.000048-0.0000170.000019-0.000002-0.000036-1.719594e-05-0.000026-0.000037
16chr11002308TCC1QTNF128862741410562-163840168116...-0.000396-0.000607-0.000421-0.000306-0.000294-0.000207-0.000237-3.867447e-04-0.000289-0.000498
17chr11002308TCUBE2J29134371437725-163840183816...-0.000809-0.001129-0.000884-0.000664-0.000741-0.000528-0.000512-7.908940e-04-0.000607-0.000990
18chr11002308TCACAP39491611473449-163840181059...-0.000826-0.001292-0.000932-0.000710-0.000789-0.000506-0.000457-9.316206e-04-0.000666-0.001122
19chr11002308TCINTS119642431488531-163840176946...-0.000033-0.000067-0.000013-0.0000100.0000300.000017-0.0000201.147389e-04-0.000005-0.000017
20chr11002308TCDVL19889701513258-163840177982...-0.000297-0.000456-0.000343-0.000253-0.000310-0.000204-0.000172-3.470182e-04-0.000220-0.000371
21chr11002308TCMXRA810013291525617-163840172928...-0.000118-0.000153-0.000130-0.000090-0.000128-0.000076-0.000053-2.347827e-04-0.000090-0.000214
22chr1109727471ACGNAT2109259481109783769-163840180515...0.0000990.0000770.0000570.0000400.0000180.0000470.0000752.750456e-040.0000650.000258
23chr1109728807TTTGGNAT2109259481109783769-163840180515...0.0032840.0051050.0026620.0014230.0015250.0012570.0019195.420089e-030.0021660.005263
24chr1109727471ACSYPL2109302706109826994+163840179428...0.0022000.0017440.0012060.0012630.001008-0.0001170.0012182.799034e-040.0015030.001372
25chr1109728807TTTGSYPL2109302706109826994+163840179428...-0.004052-0.003465-0.003092-0.003744-0.003384-0.003265-0.0032531.600981e-03-0.001915-0.002728
26chr1109727471ACATXN7L2109319639109843927+163840173165...0.0000010.0000100.000091-0.000079-0.000044-0.000061-0.000042-4.082918e-05-0.000022-0.000060
27chr1109728807TTTGATXN7L2109319639109843927+163840173165...0.000536-0.0001790.0015850.0014310.0031030.0016310.0002572.081454e-030.0009130.001677
28chr1109727471ACCYB561D1109330212109854500+163840172720...0.0005840.0005770.0006740.0005360.0001670.0003970.0005071.705885e-040.0005030.000481
29chr1109728807TTTGCYB561D1109330212109854500+163840172720...0.0037620.0028490.0014870.002733-0.0000990.0006310.0014694.820824e-040.002098-0.000244
30chr1109727471ACGPR61109376032109900320+163840172374...0.0001550.0002510.0001140.000134-0.000137-0.0000680.0001254.252791e-050.0001170.000137
31chr1109728807TTTGGPR61109376032109900320+163840172374...0.0002580.0008100.0003080.0004890.0006180.0002360.0005236.991625e-040.0001550.000536
32chr1109727471ACGSTM3109380590109904878-163840170946...-0.000651-0.000626-0.000446-0.000290-0.000174-0.000026-0.000154-3.945976e-04-0.000396-0.000499
33chr1109728807TTTGGSTM3109380590109904878-163840170946...0.0148220.0291570.0153160.0079800.0152110.0079290.0102512.181430e-020.0092140.024373
34chr1109727471ACGNAI3109384775109909063+163840233549...0.0020520.0026370.0035810.001084-0.001867-0.0038040.003472-1.005411e-030.0034000.001397
35chr1109728807TTTGGNAI3109384775109909063+163840233549...-0.002514-0.0021760.0012730.0073760.0112710.019305-0.0028862.035785e-02-0.0016600.006990
36chr1109727471ACAMPD2109452264109976552+163840179789...0.0008700.0017710.0020030.000641-0.002239-0.0004910.001610-6.132126e-040.0026030.000714
37chr1109728807TTTGAMPD2109452264109976552+163840179789...0.0152700.0048480.0114520.0285240.0099580.0137990.0075972.641273e-020.0096910.026180
38chr1109727471ACGSTM4109492259110016547+163840182577...-0.0003360.0013990.0005110.000450-0.000501-0.000243-0.0013284.479885e-040.0005440.002848
39chr1109728807TTTGGSTM4109492259110016547+163840182577...-0.032543-0.033684-0.029430-0.050647-0.030112-0.021851-0.012187-2.968431e-02-0.037312-0.067244
40chr1109727471ACGSTM2109504182110028470+163840205369...-0.001145-0.0013710.000107-0.000453-0.001597-0.000798-0.0018821.063347e-040.0001060.000086
41chr1109728807TTTGGSTM2109504182110028470+163840205369...0.0064100.0104210.0084300.010849-0.0049940.003090-0.004843-9.270430e-030.0087550.011827
42chr1109727471ACGSTM1109523974110048262+163840185065...0.0008490.0019930.001033-0.000050-0.0005340.0000950.000409-3.244877e-040.0008420.000388
43chr1109728807TTTGGSTM1109523974110048262+163840185065...0.0099730.0124820.0030730.0190830.0106060.0003870.0035127.757962e-030.0118600.019928
44chr1109727471ACGSTM5109547940110072228+163840227488...0.0024340.0019460.0018240.000374-0.000416-0.0008710.003142-1.690984e-030.0015570.002596
45chr1109728807TTTGGSTM5109547940110072228+163840227488...-0.0054340.003420-0.019319-0.0007470.0418490.022034-0.0240614.308212e-020.005673-0.007195
46chr1109727471ACALX3109710224110234512-163840174642...0.0002100.0005210.0002560.0001180.0003350.0001740.0001368.502305e-040.0001650.000544
47chr1109728807TTTGALX3109710224110234512-163840174642...-0.000382-0.000466-0.000408-0.000091-0.000512-0.000240-0.000283-1.361221e-04-0.000274-0.000745
\n", "

48 rows × 8870 columns

\n", "
" ], "text/plain": [ " chrom pos ref alt gene start end strand \\\n", "0 chr1 1002308 T C FAM41C 516455 1040743 - \n", "1 chr1 1002308 T C NOC2L 598861 1123149 - \n", "2 chr1 1002308 T C PERM1 621645 1145933 - \n", "3 chr1 1002308 T C HES4 639724 1164012 - \n", "4 chr1 1002308 T C FAM87B 653531 1177819 + \n", "5 chr1 1002308 T C RNF223 713858 1238146 - \n", "6 chr1 1002308 T C C1orf159 755913 1280201 - \n", "7 chr1 1002308 T C SAMD11 760088 1284376 + \n", "8 chr1 1002308 T C KLHL17 796744 1321032 + \n", "9 chr1 1002308 T C PLEKHN1 802642 1326930 + \n", "10 chr1 1002308 T C TTLL10-AS1 819107 1343395 - \n", "11 chr1 1002308 T C ISG15 837298 1361586 + \n", "12 chr1 1002308 T C TNFRSF18 846144 1370432 - \n", "13 chr1 1002308 T C TNFRSF4 853705 1377993 - \n", "14 chr1 1002308 T C AGRN 856280 1380568 + \n", "15 chr1 1002308 T C SDF4 871619 1395907 - \n", "16 chr1 1002308 T C C1QTNF12 886274 1410562 - \n", "17 chr1 1002308 T C UBE2J2 913437 1437725 - \n", "18 chr1 1002308 T C ACAP3 949161 1473449 - \n", "19 chr1 1002308 T C INTS11 964243 1488531 - \n", "20 chr1 1002308 T C DVL1 988970 1513258 - \n", "21 chr1 1002308 T C MXRA8 1001329 1525617 - \n", "22 chr1 109727471 A C GNAT2 109259481 109783769 - \n", "23 chr1 109728807 TTT G GNAT2 109259481 109783769 - \n", "24 chr1 109727471 A C SYPL2 109302706 109826994 + \n", "25 chr1 109728807 TTT G SYPL2 109302706 109826994 + \n", "26 chr1 109727471 A C ATXN7L2 109319639 109843927 + \n", "27 chr1 109728807 TTT G ATXN7L2 109319639 109843927 + \n", "28 chr1 109727471 A C CYB561D1 109330212 109854500 + \n", "29 chr1 109728807 TTT G CYB561D1 109330212 109854500 + \n", "30 chr1 109727471 A C GPR61 109376032 109900320 + \n", "31 chr1 109728807 TTT G GPR61 109376032 109900320 + \n", "32 chr1 109727471 A C GSTM3 109380590 109904878 - \n", "33 chr1 109728807 TTT G GSTM3 109380590 109904878 - \n", "34 chr1 109727471 A C GNAI3 109384775 109909063 + \n", "35 chr1 109728807 TTT G GNAI3 109384775 109909063 + \n", "36 chr1 109727471 A C AMPD2 109452264 109976552 + \n", "37 chr1 109728807 TTT G AMPD2 109452264 109976552 + \n", "38 chr1 109727471 A C GSTM4 109492259 110016547 + \n", "39 chr1 109728807 TTT G GSTM4 109492259 110016547 + \n", "40 chr1 109727471 A C GSTM2 109504182 110028470 + \n", "41 chr1 109728807 TTT G GSTM2 109504182 110028470 + \n", "42 chr1 109727471 A C GSTM1 109523974 110048262 + \n", "43 chr1 109728807 TTT G GSTM1 109523974 110048262 + \n", "44 chr1 109727471 A C GSTM5 109547940 110072228 + \n", "45 chr1 109728807 TTT G GSTM5 109547940 110072228 + \n", "46 chr1 109727471 A C ALX3 109710224 110234512 - \n", "47 chr1 109728807 TTT G ALX3 109710224 110234512 - \n", "\n", " gene_mask_start gene_mask_end ... agg_9528 agg_9529 agg_9530 \\\n", "0 163840 172672 ... -0.000053 -0.000153 -0.000089 \n", "1 163840 178946 ... -0.000586 -0.000891 -0.000565 \n", "2 163840 170729 ... -0.000565 -0.000787 -0.000515 \n", "3 163840 165050 ... -0.001453 -0.001775 -0.001403 \n", "4 163840 166306 ... 0.000045 0.000105 0.000060 \n", "5 163840 167179 ... -0.000561 -0.000826 -0.000536 \n", "6 163840 198383 ... -0.000399 -0.000585 -0.000374 \n", "7 163840 184493 ... 0.001317 0.001001 0.000884 \n", "8 163840 168975 ... -0.000244 -0.000214 -0.000310 \n", "9 163840 173223 ... 0.000012 -0.000113 0.000002 \n", "10 163840 170339 ... -0.000286 -0.000486 -0.000312 \n", "11 163840 177242 ... 0.007558 0.004934 0.007966 \n", "12 163840 166924 ... -0.000190 -0.000248 -0.000215 \n", "13 163840 166653 ... -0.000217 -0.000332 -0.000225 \n", "14 163840 199838 ... -0.000643 -0.000886 -0.000513 \n", "15 163840 178999 ... -0.000031 -0.000034 -0.000048 \n", "16 163840 168116 ... -0.000396 -0.000607 -0.000421 \n", "17 163840 183816 ... -0.000809 -0.001129 -0.000884 \n", "18 163840 181059 ... -0.000826 -0.001292 -0.000932 \n", "19 163840 176946 ... -0.000033 -0.000067 -0.000013 \n", "20 163840 177982 ... -0.000297 -0.000456 -0.000343 \n", "21 163840 172928 ... -0.000118 -0.000153 -0.000130 \n", "22 163840 180515 ... 0.000099 0.000077 0.000057 \n", "23 163840 180515 ... 0.003284 0.005105 0.002662 \n", "24 163840 179428 ... 0.002200 0.001744 0.001206 \n", "25 163840 179428 ... -0.004052 -0.003465 -0.003092 \n", "26 163840 173165 ... 0.000001 0.000010 0.000091 \n", "27 163840 173165 ... 0.000536 -0.000179 0.001585 \n", "28 163840 172720 ... 0.000584 0.000577 0.000674 \n", "29 163840 172720 ... 0.003762 0.002849 0.001487 \n", "30 163840 172374 ... 0.000155 0.000251 0.000114 \n", "31 163840 172374 ... 0.000258 0.000810 0.000308 \n", "32 163840 170946 ... -0.000651 -0.000626 -0.000446 \n", "33 163840 170946 ... 0.014822 0.029157 0.015316 \n", "34 163840 233549 ... 0.002052 0.002637 0.003581 \n", "35 163840 233549 ... -0.002514 -0.002176 0.001273 \n", "36 163840 179789 ... 0.000870 0.001771 0.002003 \n", "37 163840 179789 ... 0.015270 0.004848 0.011452 \n", "38 163840 182577 ... -0.000336 0.001399 0.000511 \n", "39 163840 182577 ... -0.032543 -0.033684 -0.029430 \n", "40 163840 205369 ... -0.001145 -0.001371 0.000107 \n", "41 163840 205369 ... 0.006410 0.010421 0.008430 \n", "42 163840 185065 ... 0.000849 0.001993 0.001033 \n", "43 163840 185065 ... 0.009973 0.012482 0.003073 \n", "44 163840 227488 ... 0.002434 0.001946 0.001824 \n", "45 163840 227488 ... -0.005434 0.003420 -0.019319 \n", "46 163840 174642 ... 0.000210 0.000521 0.000256 \n", "47 163840 174642 ... -0.000382 -0.000466 -0.000408 \n", "\n", " agg_9531 agg_9532 agg_9533 agg_9535 agg_9536 agg_9537 agg_9538 \n", "0 -0.000016 -0.000013 -0.000040 -0.000070 7.629395e-05 -0.000026 -0.000069 \n", "1 -0.000311 -0.000461 -0.000252 -0.000376 -6.060898e-04 -0.000454 -0.000725 \n", "2 -0.000279 -0.000354 -0.000202 -0.000342 -5.399883e-04 -0.000423 -0.000597 \n", "3 -0.000820 -0.000896 -0.000575 -0.001113 -1.119316e-03 -0.001036 -0.001202 \n", "4 -0.000030 -0.000049 0.000029 0.000135 1.490116e-07 0.000057 0.000001 \n", "5 -0.000299 -0.000324 -0.000196 -0.000374 -3.589988e-04 -0.000342 -0.000426 \n", "6 -0.000208 -0.000212 -0.000126 -0.000265 -2.926290e-04 -0.000266 -0.000376 \n", "7 0.001013 0.000691 0.001310 0.001097 5.598068e-04 0.000501 0.000827 \n", "8 -0.000153 -0.000324 -0.000168 -0.000147 -4.823208e-04 -0.000211 -0.000401 \n", "9 -0.000028 -0.000248 -0.000283 -0.000314 -1.890063e-04 -0.000022 -0.000029 \n", "10 -0.000150 -0.000256 -0.000124 -0.000179 -3.057718e-04 -0.000183 -0.000252 \n", "11 0.007595 -0.001943 0.001598 0.007838 2.528191e-03 0.007137 0.009534 \n", "12 -0.000113 -0.000070 -0.000076 -0.000159 -5.897880e-05 -0.000117 -0.000182 \n", "13 -0.000132 -0.000127 -0.000102 -0.000146 -2.127588e-04 -0.000168 -0.000276 \n", "14 -0.001688 -0.000346 -0.000528 -0.000894 -4.293919e-04 -0.001309 -0.000301 \n", "15 -0.000017 0.000019 -0.000002 -0.000036 -1.719594e-05 -0.000026 -0.000037 \n", "16 -0.000306 -0.000294 -0.000207 -0.000237 -3.867447e-04 -0.000289 -0.000498 \n", "17 -0.000664 -0.000741 -0.000528 -0.000512 -7.908940e-04 -0.000607 -0.000990 \n", "18 -0.000710 -0.000789 -0.000506 -0.000457 -9.316206e-04 -0.000666 -0.001122 \n", "19 -0.000010 0.000030 0.000017 -0.000020 1.147389e-04 -0.000005 -0.000017 \n", "20 -0.000253 -0.000310 -0.000204 -0.000172 -3.470182e-04 -0.000220 -0.000371 \n", "21 -0.000090 -0.000128 -0.000076 -0.000053 -2.347827e-04 -0.000090 -0.000214 \n", "22 0.000040 0.000018 0.000047 0.000075 2.750456e-04 0.000065 0.000258 \n", "23 0.001423 0.001525 0.001257 0.001919 5.420089e-03 0.002166 0.005263 \n", "24 0.001263 0.001008 -0.000117 0.001218 2.799034e-04 0.001503 0.001372 \n", "25 -0.003744 -0.003384 -0.003265 -0.003253 1.600981e-03 -0.001915 -0.002728 \n", "26 -0.000079 -0.000044 -0.000061 -0.000042 -4.082918e-05 -0.000022 -0.000060 \n", "27 0.001431 0.003103 0.001631 0.000257 2.081454e-03 0.000913 0.001677 \n", "28 0.000536 0.000167 0.000397 0.000507 1.705885e-04 0.000503 0.000481 \n", "29 0.002733 -0.000099 0.000631 0.001469 4.820824e-04 0.002098 -0.000244 \n", "30 0.000134 -0.000137 -0.000068 0.000125 4.252791e-05 0.000117 0.000137 \n", "31 0.000489 0.000618 0.000236 0.000523 6.991625e-04 0.000155 0.000536 \n", "32 -0.000290 -0.000174 -0.000026 -0.000154 -3.945976e-04 -0.000396 -0.000499 \n", "33 0.007980 0.015211 0.007929 0.010251 2.181430e-02 0.009214 0.024373 \n", "34 0.001084 -0.001867 -0.003804 0.003472 -1.005411e-03 0.003400 0.001397 \n", "35 0.007376 0.011271 0.019305 -0.002886 2.035785e-02 -0.001660 0.006990 \n", "36 0.000641 -0.002239 -0.000491 0.001610 -6.132126e-04 0.002603 0.000714 \n", "37 0.028524 0.009958 0.013799 0.007597 2.641273e-02 0.009691 0.026180 \n", "38 0.000450 -0.000501 -0.000243 -0.001328 4.479885e-04 0.000544 0.002848 \n", "39 -0.050647 -0.030112 -0.021851 -0.012187 -2.968431e-02 -0.037312 -0.067244 \n", "40 -0.000453 -0.001597 -0.000798 -0.001882 1.063347e-04 0.000106 0.000086 \n", "41 0.010849 -0.004994 0.003090 -0.004843 -9.270430e-03 0.008755 0.011827 \n", "42 -0.000050 -0.000534 0.000095 0.000409 -3.244877e-04 0.000842 0.000388 \n", "43 0.019083 0.010606 0.000387 0.003512 7.757962e-03 0.011860 0.019928 \n", "44 0.000374 -0.000416 -0.000871 0.003142 -1.690984e-03 0.001557 0.002596 \n", "45 -0.000747 0.041849 0.022034 -0.024061 4.308212e-02 0.005673 -0.007195 \n", "46 0.000118 0.000335 0.000174 0.000136 8.502305e-04 0.000165 0.000544 \n", "47 -0.000091 -0.000512 -0.000240 -0.000283 -1.361221e-04 -0.000274 -0.000745 \n", "\n", "[48 rows x 8870 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_parquet(\"vep_vcf_results.parquet\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can pass tsv file with following format where first 4 columns are `chrom`, `pos`, `ref`, `alt`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "chrom pos ref alt\n", "chr1 1000018 G A\n", "chr1 1002308 T C\n", "chr1 109727471 A C\n", "chr1 109728286 TTT G\n", "chr1 109728807 T GG\n" ] } ], "source": [ "! cat data/variants.tsv | column -t -s $'\\t' " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can only run predictions for the variants closer to tss than 100kbp anyway these are the ones likely to be most impactful on the gene expression." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "decima.vep - INFO - Using device: cuda and genome: hg38\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Currently logged in as: \u001b[33mcelikm5\u001b[0m (\u001b[33mcelikm5-genentech\u001b[0m) to \u001b[32mhttps://api.wandb.ai\u001b[0m. Use \u001b[1m`wandb login --relogin`\u001b[0m to force relogin\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_metadata:latest, 628.05MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1276.1MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_rep0:latest, 2155.88MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:1.1 (1916.9MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact human_state_dict_fold0:latest, 709.30MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1417.4MB/s)\n", "Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "HPU available: False, using: 0 HPUs\n", "/home/celikm5/miniforge3/envs/decima/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:76: UserWarning: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n", "/home/celikm5/miniforge3/envs/decima/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: PossibleUserWarning: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=95` in the `DataLoader` to improve performance.\n", "Predicting DataLoader 0: 100%|██████████████████| 66/66 [00:13<00:00, 4.88it/s]\n", "decima.vep - INFO - Warnings:\n", "decima.vep - INFO - allele_mismatch_with_reference_genome: 10 alleles out of 33 predictions mismatched with the genome file /home/celikm5/.local/share/genomes/hg38/hg38.fa.If this is not expected, please check if you are using the correct genome version.\n", "\u001b[0m" ] } ], "source": [ "! decima vep -v \"data/variants.tsv\" -o \"vep_results.parquet\" --max-distance 100_000 --max-distance-type \"tss\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have already have mapping genes and variant, you can use this mapping so predictions only will be conducted between this pairs." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "decima.vep - INFO - Using device: cuda and genome: hg38\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Currently logged in as: \u001b[33mcelikm5\u001b[0m (\u001b[33mcelikm5-genentech\u001b[0m) to \u001b[32mhttps://api.wandb.ai\u001b[0m. Use \u001b[1m`wandb login --relogin`\u001b[0m to force relogin\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_metadata:latest, 628.05MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1311.1MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_rep0:latest, 2155.88MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:1.1 (1900.4MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact human_state_dict_fold0:latest, 709.30MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1475.9MB/s)\n", "Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "HPU available: False, using: 0 HPUs\n", "/home/celikm5/miniforge3/envs/decima/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:76: UserWarning: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n", "/home/celikm5/miniforge3/envs/decima/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: PossibleUserWarning: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=95` in the `DataLoader` to improve performance.\n", "Predicting DataLoader 0: 100%|████████████████████| 4/4 [00:01<00:00, 3.79it/s]\n", "\u001b[0m" ] } ], "source": [ "! decima vep -v \"data/variants_gene.tsv\" -o \"vep_gene_results.parquet\" --gene-col \"gene\"" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromposrefaltgenestartendstrandgene_mask_startgene_mask_end...agg_9528agg_9529agg_9530agg_9531agg_9532agg_9533agg_9535agg_9536agg_9537agg_9538
0chr11000018GAISG158372981361586+163840177242...-0.0007460.0023010.0050670.000135-0.000559-0.0031550.006503-0.001059-0.0005660.000964
1chr11002308TCISG158372981361586+163840177242...0.0075580.0049340.0079660.007595-0.0019430.0015980.0078380.0025280.0071370.009534
\n", "

2 rows × 8870 columns

\n", "
" ], "text/plain": [ " chrom pos ref alt gene start end strand gene_mask_start \\\n", "0 chr1 1000018 G A ISG15 837298 1361586 + 163840 \n", "1 chr1 1002308 T C ISG15 837298 1361586 + 163840 \n", "\n", " gene_mask_end ... agg_9528 agg_9529 agg_9530 agg_9531 agg_9532 \\\n", "0 177242 ... -0.000746 0.002301 0.005067 0.000135 -0.000559 \n", "1 177242 ... 0.007558 0.004934 0.007966 0.007595 -0.001943 \n", "\n", " agg_9533 agg_9535 agg_9536 agg_9537 agg_9538 \n", "0 -0.003155 0.006503 -0.001059 -0.000566 0.000964 \n", "1 0.001598 0.007838 0.002528 0.007137 0.009534 \n", "\n", "[2 rows x 8870 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_parquet(\"vep_gene_results.parquet\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The vep api reads n (default=10_000) number of variants from vcf file performs predictions on these variants, saves them to parquet file then performs predictios for next next chuck. You can change chucksize:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "decima.vep - INFO - Using device: cuda and genome: hg38\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Currently logged in as: \u001b[33mcelikm5\u001b[0m (\u001b[33mcelikm5-genentech\u001b[0m) to \u001b[32mhttps://api.wandb.ai\u001b[0m. Use \u001b[1m`wandb login --relogin`\u001b[0m to force relogin\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_metadata:latest, 628.05MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1365.3MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_rep0:latest, 2155.88MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:1.2 (1851.4MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact human_state_dict_fold0:latest, 709.30MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1323.7MB/s)\n", "Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "HPU available: False, using: 0 HPUs\n", "/home/celikm5/miniforge3/envs/decima/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:76: UserWarning: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n", "/home/celikm5/miniforge3/envs/decima/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: PossibleUserWarning: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=95` in the `DataLoader` to improve performance.\n", "Predicting DataLoader 0: 100%|██████████████████| 44/44 [00:09<00:00, 4.86it/s]\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_metadata:latest, 628.05MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1340.8MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_rep0:latest, 2155.88MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:1.1 (1944.0MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact human_state_dict_fold0:latest, 709.30MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.4 (1576.5MB/s)\n", "Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "HPU available: False, using: 0 HPUs\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n", "/home/celikm5/miniforge3/envs/decima/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: PossibleUserWarning: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=95` in the `DataLoader` to improve performance.\n", "Predicting DataLoader 0: 100%|██████████████████| 26/26 [00:05<00:00, 4.98it/s]\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_metadata:latest, 628.05MB. 1 files... \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1359.5MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_rep0:latest, 2155.88MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:1.1 (1939.7MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact human_state_dict_fold0:latest, 709.30MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1425.5MB/s)\n", "Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "HPU available: False, using: 0 HPUs\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n", "/home/celikm5/miniforge3/envs/decima/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: PossibleUserWarning: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=95` in the `DataLoader` to improve performance.\n", "Predicting DataLoader 0: 100%|██████████████████| 26/26 [00:05<00:00, 4.97it/s]\n", "decima.vep - INFO - Warnings:\n", "decima.vep - INFO - allele_mismatch_with_reference_genome: 26 alleles out of 48 predictions mismatched with the genome file /home/celikm5/.local/share/genomes/hg38/hg38.fa.If this is not expected, please check if you are using the correct genome version.\n", "\u001b[0m" ] } ], "source": [ "! decima vep -v \"data/sample.vcf\" -o \"vep_vcf_results.parquet\" --chunksize 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Python API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, variant effect prediction can be performed using the Python API as well." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import torch\n", "from decima.vep import predict_variant_effect\n", "\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromposrefalt
0chr11000018GA
1chr11002308TC
2chr1109727471AC
3chr1109728286TTTG
4chr1109728807TGG
\n", "
" ], "text/plain": [ " chrom pos ref alt\n", "0 chr1 1000018 G A\n", "1 chr1 1002308 T C\n", "2 chr1 109727471 A C\n", "3 chr1 109728286 TTT G\n", "4 chr1 109728807 T GG" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_variant = pd.read_table(\"data/variants.tsv\")\n", "df_variant" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Simply pass your dataframe to `predict_variat_effect` function which will return dataframe for the prediction. You can pass `tasks` query to subset predictions for specific cells. Moreover, by default decima model for replicate 0 is used to use other replicates pass model=`1` , `2` or `3` to use other replicates or pass your custom model. If you pass `include_cols` argument the columns in the input will maintained in the output. To further variants based on distance to tss use `max_dist_tss` argument." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[34m\u001b[1mwandb\u001b[0m: Currently logged in as: \u001b[33mcelikm5\u001b[0m (\u001b[33mcelikm5-genentech\u001b[0m) to \u001b[32mhttps://api.wandb.ai\u001b[0m. Use \u001b[1m`wandb login --relogin`\u001b[0m to force relogin\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_metadata:latest, 628.05MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1339.8MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_rep0:latest, 2155.88MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:1.2 (1791.0MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact human_state_dict_fold0:latest, 709.30MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1495.3MB/s)\n", "Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "HPU available: False, using: 0 HPUs\n", "/home/celikm5/miniforge3/envs/decima/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:76: UserWarning: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "a73a894cc53a4b2f98c395e3b8c6bf72", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Predicting: | | 0/? [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromposrefaltgenestartendstrandgene_mask_startgene_mask_end...agg_9528agg_9529agg_9530agg_9531agg_9532agg_9533agg_9535agg_9536agg_9537agg_9538
0chr11000018GAFAM41C5164551040743-163840172672...-0.003487-0.006149-0.003175-0.002726-0.003147-0.001746-0.002291-0.005283-0.003393-0.006694
1chr11002308TCFAM41C5164551040743-163840172672...-0.000096-0.000190-0.000135-0.000043-0.000076-0.000072-0.0001060.000029-0.000055-0.000119
2chr11000018GANOC2L5988611123149-163840178946...-0.002595-0.004256-0.002759-0.001601-0.002601-0.001238-0.001756-0.002918-0.001914-0.003244
3chr11002308TCNOC2L5988611123149-163840178946...-0.000587-0.000894-0.000563-0.000324-0.000471-0.000258-0.000382-0.000683-0.000459-0.000757
4chr11000018GAPERM16216451145933-163840170729...-0.002933-0.004738-0.003541-0.001943-0.002958-0.001370-0.002272-0.003326-0.002195-0.003648
..................................................................
77chr1109728286TTTGGSTM5109547940110072228+163840227488...-0.0043910.004506-0.0171410.0000250.0428100.024843-0.0220730.0439730.005934-0.006544
78chr1109728807TGGGSTM5109547940110072228+163840227488...-0.129386-0.144098-0.077484-0.091391-0.104009-0.034113-0.096020-0.063028-0.079460-0.102111
79chr1109727471ACALX3109710224110234512-163840174642...0.0002180.0005320.0002630.0001220.0003520.0001840.0001390.0008870.0001730.000558
80chr1109728286TTTGALX3109710224110234512-163840174642...-0.000218-0.000109-0.0002040.000009-0.000265-0.000104-0.0001890.000523-0.000149-0.000385
81chr1109728807TGGALX3109710224110234512-163840174642...-0.001127-0.002115-0.001278-0.000581-0.001344-0.000724-0.000569-0.003553-0.000857-0.002281
\n", "

82 rows × 8870 columns

\n", "" ], "text/plain": [ " chrom pos ref alt gene start end strand \\\n", "0 chr1 1000018 G A FAM41C 516455 1040743 - \n", "1 chr1 1002308 T C FAM41C 516455 1040743 - \n", "2 chr1 1000018 G A NOC2L 598861 1123149 - \n", "3 chr1 1002308 T C NOC2L 598861 1123149 - \n", "4 chr1 1000018 G A PERM1 621645 1145933 - \n", ".. ... ... ... .. ... ... ... ... \n", "77 chr1 109728286 TTT G GSTM5 109547940 110072228 + \n", "78 chr1 109728807 T GG GSTM5 109547940 110072228 + \n", "79 chr1 109727471 A C ALX3 109710224 110234512 - \n", "80 chr1 109728286 TTT G ALX3 109710224 110234512 - \n", "81 chr1 109728807 T GG ALX3 109710224 110234512 - \n", "\n", " gene_mask_start gene_mask_end ... agg_9528 agg_9529 agg_9530 \\\n", "0 163840 172672 ... -0.003487 -0.006149 -0.003175 \n", "1 163840 172672 ... -0.000096 -0.000190 -0.000135 \n", "2 163840 178946 ... -0.002595 -0.004256 -0.002759 \n", "3 163840 178946 ... -0.000587 -0.000894 -0.000563 \n", "4 163840 170729 ... -0.002933 -0.004738 -0.003541 \n", ".. ... ... ... ... ... ... \n", "77 163840 227488 ... -0.004391 0.004506 -0.017141 \n", "78 163840 227488 ... -0.129386 -0.144098 -0.077484 \n", "79 163840 174642 ... 0.000218 0.000532 0.000263 \n", "80 163840 174642 ... -0.000218 -0.000109 -0.000204 \n", "81 163840 174642 ... -0.001127 -0.002115 -0.001278 \n", "\n", " agg_9531 agg_9532 agg_9533 agg_9535 agg_9536 agg_9537 agg_9538 \n", "0 -0.002726 -0.003147 -0.001746 -0.002291 -0.005283 -0.003393 -0.006694 \n", "1 -0.000043 -0.000076 -0.000072 -0.000106 0.000029 -0.000055 -0.000119 \n", "2 -0.001601 -0.002601 -0.001238 -0.001756 -0.002918 -0.001914 -0.003244 \n", "3 -0.000324 -0.000471 -0.000258 -0.000382 -0.000683 -0.000459 -0.000757 \n", "4 -0.001943 -0.002958 -0.001370 -0.002272 -0.003326 -0.002195 -0.003648 \n", ".. ... ... ... ... ... ... ... \n", "77 0.000025 0.042810 0.024843 -0.022073 0.043973 0.005934 -0.006544 \n", "78 -0.091391 -0.104009 -0.034113 -0.096020 -0.063028 -0.079460 -0.102111 \n", "79 0.000122 0.000352 0.000184 0.000139 0.000887 0.000173 0.000558 \n", "80 0.000009 -0.000265 -0.000104 -0.000189 0.000523 -0.000149 -0.000385 \n", "81 -0.000581 -0.001344 -0.000724 -0.000569 -0.003553 -0.000857 -0.002281 \n", "\n", "[82 rows x 8870 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predict_variant_effect(df_variant)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can predict and save predictions to file similar to CLI api based on dataframe." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_metadata:latest, 628.05MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1289.8MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_rep0:latest, 2155.88MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:1.2 (1774.2MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact human_state_dict_fold0:latest, 709.30MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1430.2MB/s)\n", "Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "HPU available: False, using: 0 HPUs\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f9c815b42b1b4ba6ad1edb5427a53db6", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Predicting: | | 0/? [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromposrefaltgenestartendstrandgene_mask_startgene_mask_end...agg_9528agg_9529agg_9530agg_9531agg_9532agg_9533agg_9535agg_9536agg_9537agg_9538
0chr11000018GAFAM41C5164551040743-163840172672...-0.003487-0.006149-0.003175-0.002726-0.003147-0.001746-0.002291-0.005283-0.003393-0.006694
1chr11002308TCFAM41C5164551040743-163840172672...-0.000096-0.000190-0.000135-0.000043-0.000076-0.000072-0.0001060.000029-0.000055-0.000119
2chr11000018GANOC2L5988611123149-163840178946...-0.002595-0.004256-0.002759-0.001601-0.002601-0.001238-0.001756-0.002918-0.001914-0.003244
3chr11002308TCNOC2L5988611123149-163840178946...-0.000587-0.000894-0.000563-0.000324-0.000471-0.000258-0.000382-0.000683-0.000459-0.000757
4chr11000018GAPERM16216451145933-163840170729...-0.002933-0.004738-0.003541-0.001943-0.002958-0.001370-0.002272-0.003326-0.002195-0.003648
..................................................................
77chr1109728286TTTGGSTM5109547940110072228+163840227488...-0.0043910.004506-0.0171410.0000250.0428100.024843-0.0220730.0439730.005934-0.006544
78chr1109728807TGGGSTM5109547940110072228+163840227488...-0.129386-0.144098-0.077484-0.091391-0.104009-0.034113-0.096020-0.063028-0.079460-0.102111
79chr1109727471ACALX3109710224110234512-163840174642...0.0002180.0005320.0002630.0001220.0003520.0001840.0001390.0008870.0001730.000558
80chr1109728286TTTGALX3109710224110234512-163840174642...-0.000218-0.000109-0.0002040.000009-0.000265-0.000104-0.0001890.000523-0.000149-0.000385
81chr1109728807TGGALX3109710224110234512-163840174642...-0.001127-0.002115-0.001278-0.000581-0.001344-0.000724-0.000569-0.003553-0.000857-0.002281
\n", "

82 rows × 8870 columns

\n", "" ], "text/plain": [ " chrom pos ref alt gene start end strand \\\n", "0 chr1 1000018 G A FAM41C 516455 1040743 - \n", "1 chr1 1002308 T C FAM41C 516455 1040743 - \n", "2 chr1 1000018 G A NOC2L 598861 1123149 - \n", "3 chr1 1002308 T C NOC2L 598861 1123149 - \n", "4 chr1 1000018 G A PERM1 621645 1145933 - \n", ".. ... ... ... .. ... ... ... ... \n", "77 chr1 109728286 TTT G GSTM5 109547940 110072228 + \n", "78 chr1 109728807 T GG GSTM5 109547940 110072228 + \n", "79 chr1 109727471 A C ALX3 109710224 110234512 - \n", "80 chr1 109728286 TTT G ALX3 109710224 110234512 - \n", "81 chr1 109728807 T GG ALX3 109710224 110234512 - \n", "\n", " gene_mask_start gene_mask_end ... agg_9528 agg_9529 agg_9530 \\\n", "0 163840 172672 ... -0.003487 -0.006149 -0.003175 \n", "1 163840 172672 ... -0.000096 -0.000190 -0.000135 \n", "2 163840 178946 ... -0.002595 -0.004256 -0.002759 \n", "3 163840 178946 ... -0.000587 -0.000894 -0.000563 \n", "4 163840 170729 ... -0.002933 -0.004738 -0.003541 \n", ".. ... ... ... ... ... ... \n", "77 163840 227488 ... -0.004391 0.004506 -0.017141 \n", "78 163840 227488 ... -0.129386 -0.144098 -0.077484 \n", "79 163840 174642 ... 0.000218 0.000532 0.000263 \n", "80 163840 174642 ... -0.000218 -0.000109 -0.000204 \n", "81 163840 174642 ... -0.001127 -0.002115 -0.001278 \n", "\n", " agg_9531 agg_9532 agg_9533 agg_9535 agg_9536 agg_9537 agg_9538 \n", "0 -0.002726 -0.003147 -0.001746 -0.002291 -0.005283 -0.003393 -0.006694 \n", "1 -0.000043 -0.000076 -0.000072 -0.000106 0.000029 -0.000055 -0.000119 \n", "2 -0.001601 -0.002601 -0.001238 -0.001756 -0.002918 -0.001914 -0.003244 \n", "3 -0.000324 -0.000471 -0.000258 -0.000382 -0.000683 -0.000459 -0.000757 \n", "4 -0.001943 -0.002958 -0.001370 -0.002272 -0.003326 -0.002195 -0.003648 \n", ".. ... ... ... ... ... ... ... \n", "77 0.000025 0.042810 0.024843 -0.022073 0.043973 0.005934 -0.006544 \n", "78 -0.091391 -0.104009 -0.034113 -0.096020 -0.063028 -0.079460 -0.102111 \n", "79 0.000122 0.000352 0.000184 0.000139 0.000887 0.000173 0.000558 \n", "80 0.000009 -0.000265 -0.000104 -0.000189 0.000523 -0.000149 -0.000385 \n", "81 -0.000581 -0.001344 -0.000724 -0.000569 -0.003553 -0.000857 -0.002281 \n", "\n", "[82 rows x 8870 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_parquet(\"vep_results_py.parquet\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or variant effect can be performed on vcf file." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_metadata:latest, 628.05MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1272.2MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_rep0:latest, 2155.88MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:1.2 (1862.4MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact human_state_dict_fold0:latest, 709.30MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1514.0MB/s)\n", "Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "HPU available: False, using: 0 HPUs\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6c19dcc02b5f43c89ea51643583f352d", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Predicting: | | 0/? [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromposrefaltgenestartendstrandgene_mask_startgene_mask_end...agg_9528agg_9529agg_9530agg_9531agg_9532agg_9533agg_9535agg_9536agg_9537agg_9538
0chr11002308TCFAM41C5164551040743-163840172672...-0.000096-0.000190-0.000135-0.000043-0.000076-0.000072-0.0001060.000029-0.000055-0.000119
1chr11002308TCNOC2L5988611123149-163840178946...-0.000587-0.000894-0.000563-0.000324-0.000471-0.000258-0.000382-0.000683-0.000459-0.000757
2chr11002308TCPERM16216451145933-163840170729...-0.000845-0.001181-0.000784-0.000442-0.000589-0.000313-0.000526-0.000856-0.000637-0.000881
3chr11002308TCHES46397241164012-163840165050...-0.001767-0.002189-0.001705-0.001091-0.001184-0.000717-0.001306-0.001550-0.001349-0.001550
4chr11002308TCFAM87B6535311177819+163840166306...-0.0000980.000052-0.000040-0.0000510.0000370.000028-0.0000240.000093-0.000080-0.000203
5chr11002308TCRNF2237138581238146-163840167179...-0.000603-0.000860-0.000584-0.000328-0.000361-0.000211-0.000395-0.000428-0.000371-0.000468
6chr11002308TCC1orf1597559131280201-163840198383...-0.000358-0.000528-0.000338-0.000193-0.000205-0.000113-0.000245-0.000265-0.000241-0.000347
7chr11002308TCSAMD117600881284376+163840184493...0.0013380.0008490.0008670.0009030.0011980.0015150.0011500.0010190.0004980.001034
8chr11002308TCKLHL177967441321032+163840168975...-0.000518-0.000364-0.000522-0.000307-0.000214-0.000158-0.000330-0.000363-0.000348-0.000509
9chr11002308TCPLEKHN18026421326930+163840173223...0.0001090.0000110.0000540.000009-0.000056-0.000199-0.000207-0.0001120.0000430.000051
10chr11002308TCTTLL10-AS18191071343395-163840170339...-0.000278-0.000460-0.000305-0.000142-0.000220-0.000106-0.000170-0.000272-0.000170-0.000231
11chr11002308TCISG158372981361586+163840177242...0.0078920.0052660.0083240.007764-0.0015800.0024680.0086560.0029250.0074850.010022
12chr11002308TCTNFRSF188461441370432-163840166924...-0.000308-0.000433-0.000336-0.000197-0.000195-0.000144-0.000220-0.000190-0.000199-0.000302
13chr11002308TCTNFRSF48537051377993-163840166653...-0.000255-0.000385-0.000261-0.000158-0.000164-0.000123-0.000158-0.000281-0.000191-0.000319
14chr11002308TCAGRN8562801380568+163840199838...-0.000901-0.001054-0.000771-0.001977-0.000437-0.000555-0.001084-0.000583-0.001613-0.000536
15chr11002308TCSDF48716191395907-163840178999...-0.000146-0.000230-0.000180-0.000099-0.000132-0.000082-0.000092-0.000189-0.000112-0.000195
16chr11002308TCC1QTNF128862741410562-163840168116...-0.000458-0.000697-0.000491-0.000359-0.000367-0.000241-0.000268-0.000487-0.000344-0.000595
17chr11002308TCUBE2J29134371437725-163840183816...-0.000815-0.001141-0.000897-0.000674-0.000717-0.000515-0.000520-0.000780-0.000614-0.001010
18chr11002308TCACAP39491611473449-163840181059...-0.000748-0.001163-0.000843-0.000615-0.000704-0.000456-0.000404-0.000761-0.000572-0.000986
19chr11002308TCINTS119642431488531-163840176946...0.0000210.0000080.0000370.0000350.0000960.0000570.0000090.0001840.0000390.000044
20chr11002308TCDVL19889701513258-163840177982...-0.000325-0.000491-0.000359-0.000260-0.000322-0.000207-0.000179-0.000336-0.000235-0.000382
21chr11002308TCMXRA810013291525617-163840172928...-0.000077-0.000098-0.000090-0.000067-0.000092-0.000057-0.000034-0.000212-0.000063-0.000159
22chr1109727471ACGNAT2109259481109783769-163840180515...0.0001030.0000700.0000600.0000380.0000270.0000550.0000780.0002770.0000620.000245
23chr1109728807TTTGGNAT2109259481109783769-163840180515...0.0033170.0051640.0026980.0014380.0015580.0012730.0019370.0054840.0021900.005321
24chr1109727471ACSYPL2109302706109826994+163840179428...0.0022250.0016880.0013660.0012510.0013420.0001770.0014020.0004800.0016630.001764
25chr1109728807TTTGSYPL2109302706109826994+163840179428...-0.004184-0.003700-0.003180-0.004169-0.003774-0.003399-0.0033820.000949-0.002237-0.002981
26chr1109727471ACATXN7L2109319639109843927+163840173165...-0.000055-0.0000360.000088-0.0000850.000037-0.000040-0.000056-0.000044-0.000046-0.000027
27chr1109728807TTTGATXN7L2109319639109843927+163840173165...0.000644-0.0000630.0017060.0014860.0031620.0016710.0002990.0021020.0010070.001831
28chr1109727471ACCYB561D1109330212109854500+163840172720...0.0005330.0005390.0005850.0004740.0000760.0002800.000422-0.0000330.0004360.000369
29chr1109728807TTTGCYB561D1109330212109854500+163840172720...0.0039020.0029630.0016090.0028410.0000240.0007260.0015820.0004970.002237-0.000092
30chr1109727471ACGPR61109376032109900320+163840172374...0.0001440.0002440.0001010.000130-0.000119-0.0000660.0001120.0000590.0001130.000135
31chr1109728807TTTGGPR61109376032109900320+163840172374...0.0002870.0008690.0003380.0005160.0007020.0002600.0005470.0008150.0001900.000561
32chr1109727471ACGSTM3109380590109904878-163840170946...-0.000685-0.000671-0.000490-0.000305-0.000203-0.000045-0.000183-0.000423-0.000412-0.000544
33chr1109728807TTTGGSTM3109380590109904878-163840170946...0.0148590.0291830.0153640.0079960.0152550.0079500.0102780.0218510.0092300.024402
34chr1109727471ACGNAI3109384775109909063+163840233549...0.0022940.0029540.0037620.001515-0.001879-0.0036380.004167-0.0009770.0035680.001709
35chr1109728807TTTGGNAI3109384775109909063+163840233549...-0.002409-0.0021810.0015150.0075040.0112760.019385-0.0029450.020405-0.0015690.007003
36chr1109727471ACAMPD2109452264109976552+163840179789...0.0005650.0012650.001679-0.000034-0.002686-0.0008780.001329-0.0011120.0020590.000288
37chr1109728807TTTGAMPD2109452264109976552+163840179789...0.0155210.0052210.0116600.0288070.0103680.0140500.0078100.0265530.0099140.026500
38chr1109727471ACGSTM4109492259110016547+163840182577...-0.0001400.0013700.0007550.000528-0.000446-0.000184-0.0010620.0005930.0005750.002940
39chr1109728807TTTGGSTM4109492259110016547+163840182577...-0.032480-0.033616-0.029502-0.050562-0.030190-0.021981-0.012192-0.029633-0.037296-0.067220
40chr1109727471ACGSTM2109504182110028470+163840205369...-0.001281-0.001583-0.000117-0.000501-0.001837-0.001011-0.0022220.0000310.0000680.000062
41chr1109728807TTTGGSTM2109504182110028470+163840205369...0.0063890.0103910.0082920.010647-0.0055110.002638-0.004853-0.0098290.0085420.011590
42chr1109727471ACGSTM1109523974110048262+163840185065...0.0010770.0023050.001138-0.000056-0.0007380.0000590.000456-0.0003960.0009040.000512
43chr1109728807TTTGGSTM1109523974110048262+163840185065...0.0095870.0121550.0027690.0189080.0106020.0003020.0032310.0076300.0117150.019595
44chr1109727471ACGSTM5109547940110072228+163840227488...0.0019780.0013950.0015010.000012-0.000538-0.0009140.002817-0.0019980.0011490.002069
45chr1109728807TTTGGSTM5109547940110072228+163840227488...-0.0048030.004060-0.018828-0.0003830.0423740.022364-0.0234500.0433530.006033-0.006614
46chr1109727471ACALX3109710224110234512-163840174642...0.0002180.0005320.0002630.0001220.0003520.0001840.0001390.0008870.0001730.000558
47chr1109728807TTTGALX3109710224110234512-163840174642...-0.000386-0.000480-0.000415-0.000092-0.000513-0.000241-0.000285-0.000141-0.000277-0.000758
\n", "

48 rows × 8870 columns

\n", "" ], "text/plain": [ " chrom pos ref alt gene start end strand \\\n", "0 chr1 1002308 T C FAM41C 516455 1040743 - \n", "1 chr1 1002308 T C NOC2L 598861 1123149 - \n", "2 chr1 1002308 T C PERM1 621645 1145933 - \n", "3 chr1 1002308 T C HES4 639724 1164012 - \n", "4 chr1 1002308 T C FAM87B 653531 1177819 + \n", "5 chr1 1002308 T C RNF223 713858 1238146 - \n", "6 chr1 1002308 T C C1orf159 755913 1280201 - \n", "7 chr1 1002308 T C SAMD11 760088 1284376 + \n", "8 chr1 1002308 T C KLHL17 796744 1321032 + \n", "9 chr1 1002308 T C PLEKHN1 802642 1326930 + \n", "10 chr1 1002308 T C TTLL10-AS1 819107 1343395 - \n", "11 chr1 1002308 T C ISG15 837298 1361586 + \n", "12 chr1 1002308 T C TNFRSF18 846144 1370432 - \n", "13 chr1 1002308 T C TNFRSF4 853705 1377993 - \n", "14 chr1 1002308 T C AGRN 856280 1380568 + \n", "15 chr1 1002308 T C SDF4 871619 1395907 - \n", "16 chr1 1002308 T C C1QTNF12 886274 1410562 - \n", "17 chr1 1002308 T C UBE2J2 913437 1437725 - \n", "18 chr1 1002308 T C ACAP3 949161 1473449 - \n", "19 chr1 1002308 T C INTS11 964243 1488531 - \n", "20 chr1 1002308 T C DVL1 988970 1513258 - \n", "21 chr1 1002308 T C MXRA8 1001329 1525617 - \n", "22 chr1 109727471 A C GNAT2 109259481 109783769 - \n", "23 chr1 109728807 TTT G GNAT2 109259481 109783769 - \n", "24 chr1 109727471 A C SYPL2 109302706 109826994 + \n", "25 chr1 109728807 TTT G SYPL2 109302706 109826994 + \n", "26 chr1 109727471 A C ATXN7L2 109319639 109843927 + \n", "27 chr1 109728807 TTT G ATXN7L2 109319639 109843927 + \n", "28 chr1 109727471 A C CYB561D1 109330212 109854500 + \n", "29 chr1 109728807 TTT G CYB561D1 109330212 109854500 + \n", "30 chr1 109727471 A C GPR61 109376032 109900320 + \n", "31 chr1 109728807 TTT G GPR61 109376032 109900320 + \n", "32 chr1 109727471 A C GSTM3 109380590 109904878 - \n", "33 chr1 109728807 TTT G GSTM3 109380590 109904878 - \n", "34 chr1 109727471 A C GNAI3 109384775 109909063 + \n", "35 chr1 109728807 TTT G GNAI3 109384775 109909063 + \n", "36 chr1 109727471 A C AMPD2 109452264 109976552 + \n", "37 chr1 109728807 TTT G AMPD2 109452264 109976552 + \n", "38 chr1 109727471 A C GSTM4 109492259 110016547 + \n", "39 chr1 109728807 TTT G GSTM4 109492259 110016547 + \n", "40 chr1 109727471 A C GSTM2 109504182 110028470 + \n", "41 chr1 109728807 TTT G GSTM2 109504182 110028470 + \n", "42 chr1 109727471 A C GSTM1 109523974 110048262 + \n", "43 chr1 109728807 TTT G GSTM1 109523974 110048262 + \n", "44 chr1 109727471 A C GSTM5 109547940 110072228 + \n", "45 chr1 109728807 TTT G GSTM5 109547940 110072228 + \n", "46 chr1 109727471 A C ALX3 109710224 110234512 - \n", "47 chr1 109728807 TTT G ALX3 109710224 110234512 - \n", "\n", " gene_mask_start gene_mask_end ... agg_9528 agg_9529 agg_9530 \\\n", "0 163840 172672 ... -0.000096 -0.000190 -0.000135 \n", "1 163840 178946 ... -0.000587 -0.000894 -0.000563 \n", "2 163840 170729 ... -0.000845 -0.001181 -0.000784 \n", "3 163840 165050 ... -0.001767 -0.002189 -0.001705 \n", "4 163840 166306 ... -0.000098 0.000052 -0.000040 \n", "5 163840 167179 ... -0.000603 -0.000860 -0.000584 \n", "6 163840 198383 ... -0.000358 -0.000528 -0.000338 \n", "7 163840 184493 ... 0.001338 0.000849 0.000867 \n", "8 163840 168975 ... -0.000518 -0.000364 -0.000522 \n", "9 163840 173223 ... 0.000109 0.000011 0.000054 \n", "10 163840 170339 ... -0.000278 -0.000460 -0.000305 \n", "11 163840 177242 ... 0.007892 0.005266 0.008324 \n", "12 163840 166924 ... -0.000308 -0.000433 -0.000336 \n", "13 163840 166653 ... -0.000255 -0.000385 -0.000261 \n", "14 163840 199838 ... -0.000901 -0.001054 -0.000771 \n", "15 163840 178999 ... -0.000146 -0.000230 -0.000180 \n", "16 163840 168116 ... -0.000458 -0.000697 -0.000491 \n", "17 163840 183816 ... -0.000815 -0.001141 -0.000897 \n", "18 163840 181059 ... -0.000748 -0.001163 -0.000843 \n", "19 163840 176946 ... 0.000021 0.000008 0.000037 \n", "20 163840 177982 ... -0.000325 -0.000491 -0.000359 \n", "21 163840 172928 ... -0.000077 -0.000098 -0.000090 \n", "22 163840 180515 ... 0.000103 0.000070 0.000060 \n", "23 163840 180515 ... 0.003317 0.005164 0.002698 \n", "24 163840 179428 ... 0.002225 0.001688 0.001366 \n", "25 163840 179428 ... -0.004184 -0.003700 -0.003180 \n", "26 163840 173165 ... -0.000055 -0.000036 0.000088 \n", "27 163840 173165 ... 0.000644 -0.000063 0.001706 \n", "28 163840 172720 ... 0.000533 0.000539 0.000585 \n", "29 163840 172720 ... 0.003902 0.002963 0.001609 \n", "30 163840 172374 ... 0.000144 0.000244 0.000101 \n", "31 163840 172374 ... 0.000287 0.000869 0.000338 \n", "32 163840 170946 ... -0.000685 -0.000671 -0.000490 \n", "33 163840 170946 ... 0.014859 0.029183 0.015364 \n", "34 163840 233549 ... 0.002294 0.002954 0.003762 \n", "35 163840 233549 ... -0.002409 -0.002181 0.001515 \n", "36 163840 179789 ... 0.000565 0.001265 0.001679 \n", "37 163840 179789 ... 0.015521 0.005221 0.011660 \n", "38 163840 182577 ... -0.000140 0.001370 0.000755 \n", "39 163840 182577 ... -0.032480 -0.033616 -0.029502 \n", "40 163840 205369 ... -0.001281 -0.001583 -0.000117 \n", "41 163840 205369 ... 0.006389 0.010391 0.008292 \n", "42 163840 185065 ... 0.001077 0.002305 0.001138 \n", "43 163840 185065 ... 0.009587 0.012155 0.002769 \n", "44 163840 227488 ... 0.001978 0.001395 0.001501 \n", "45 163840 227488 ... -0.004803 0.004060 -0.018828 \n", "46 163840 174642 ... 0.000218 0.000532 0.000263 \n", "47 163840 174642 ... -0.000386 -0.000480 -0.000415 \n", "\n", " agg_9531 agg_9532 agg_9533 agg_9535 agg_9536 agg_9537 agg_9538 \n", "0 -0.000043 -0.000076 -0.000072 -0.000106 0.000029 -0.000055 -0.000119 \n", "1 -0.000324 -0.000471 -0.000258 -0.000382 -0.000683 -0.000459 -0.000757 \n", "2 -0.000442 -0.000589 -0.000313 -0.000526 -0.000856 -0.000637 -0.000881 \n", "3 -0.001091 -0.001184 -0.000717 -0.001306 -0.001550 -0.001349 -0.001550 \n", "4 -0.000051 0.000037 0.000028 -0.000024 0.000093 -0.000080 -0.000203 \n", "5 -0.000328 -0.000361 -0.000211 -0.000395 -0.000428 -0.000371 -0.000468 \n", "6 -0.000193 -0.000205 -0.000113 -0.000245 -0.000265 -0.000241 -0.000347 \n", "7 0.000903 0.001198 0.001515 0.001150 0.001019 0.000498 0.001034 \n", "8 -0.000307 -0.000214 -0.000158 -0.000330 -0.000363 -0.000348 -0.000509 \n", "9 0.000009 -0.000056 -0.000199 -0.000207 -0.000112 0.000043 0.000051 \n", "10 -0.000142 -0.000220 -0.000106 -0.000170 -0.000272 -0.000170 -0.000231 \n", "11 0.007764 -0.001580 0.002468 0.008656 0.002925 0.007485 0.010022 \n", "12 -0.000197 -0.000195 -0.000144 -0.000220 -0.000190 -0.000199 -0.000302 \n", "13 -0.000158 -0.000164 -0.000123 -0.000158 -0.000281 -0.000191 -0.000319 \n", "14 -0.001977 -0.000437 -0.000555 -0.001084 -0.000583 -0.001613 -0.000536 \n", "15 -0.000099 -0.000132 -0.000082 -0.000092 -0.000189 -0.000112 -0.000195 \n", "16 -0.000359 -0.000367 -0.000241 -0.000268 -0.000487 -0.000344 -0.000595 \n", "17 -0.000674 -0.000717 -0.000515 -0.000520 -0.000780 -0.000614 -0.001010 \n", "18 -0.000615 -0.000704 -0.000456 -0.000404 -0.000761 -0.000572 -0.000986 \n", "19 0.000035 0.000096 0.000057 0.000009 0.000184 0.000039 0.000044 \n", "20 -0.000260 -0.000322 -0.000207 -0.000179 -0.000336 -0.000235 -0.000382 \n", "21 -0.000067 -0.000092 -0.000057 -0.000034 -0.000212 -0.000063 -0.000159 \n", "22 0.000038 0.000027 0.000055 0.000078 0.000277 0.000062 0.000245 \n", "23 0.001438 0.001558 0.001273 0.001937 0.005484 0.002190 0.005321 \n", "24 0.001251 0.001342 0.000177 0.001402 0.000480 0.001663 0.001764 \n", "25 -0.004169 -0.003774 -0.003399 -0.003382 0.000949 -0.002237 -0.002981 \n", "26 -0.000085 0.000037 -0.000040 -0.000056 -0.000044 -0.000046 -0.000027 \n", "27 0.001486 0.003162 0.001671 0.000299 0.002102 0.001007 0.001831 \n", "28 0.000474 0.000076 0.000280 0.000422 -0.000033 0.000436 0.000369 \n", "29 0.002841 0.000024 0.000726 0.001582 0.000497 0.002237 -0.000092 \n", "30 0.000130 -0.000119 -0.000066 0.000112 0.000059 0.000113 0.000135 \n", "31 0.000516 0.000702 0.000260 0.000547 0.000815 0.000190 0.000561 \n", "32 -0.000305 -0.000203 -0.000045 -0.000183 -0.000423 -0.000412 -0.000544 \n", "33 0.007996 0.015255 0.007950 0.010278 0.021851 0.009230 0.024402 \n", "34 0.001515 -0.001879 -0.003638 0.004167 -0.000977 0.003568 0.001709 \n", "35 0.007504 0.011276 0.019385 -0.002945 0.020405 -0.001569 0.007003 \n", "36 -0.000034 -0.002686 -0.000878 0.001329 -0.001112 0.002059 0.000288 \n", "37 0.028807 0.010368 0.014050 0.007810 0.026553 0.009914 0.026500 \n", "38 0.000528 -0.000446 -0.000184 -0.001062 0.000593 0.000575 0.002940 \n", "39 -0.050562 -0.030190 -0.021981 -0.012192 -0.029633 -0.037296 -0.067220 \n", "40 -0.000501 -0.001837 -0.001011 -0.002222 0.000031 0.000068 0.000062 \n", "41 0.010647 -0.005511 0.002638 -0.004853 -0.009829 0.008542 0.011590 \n", "42 -0.000056 -0.000738 0.000059 0.000456 -0.000396 0.000904 0.000512 \n", "43 0.018908 0.010602 0.000302 0.003231 0.007630 0.011715 0.019595 \n", "44 0.000012 -0.000538 -0.000914 0.002817 -0.001998 0.001149 0.002069 \n", "45 -0.000383 0.042374 0.022364 -0.023450 0.043353 0.006033 -0.006614 \n", "46 0.000122 0.000352 0.000184 0.000139 0.000887 0.000173 0.000558 \n", "47 -0.000092 -0.000513 -0.000241 -0.000285 -0.000141 -0.000277 -0.000758 \n", "\n", "[48 rows x 8870 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_parquet(\"vep_results_vcf_py.parquet\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Developer API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To perform variant effect prediction, Decima creates dataset and dataloader from the given set of variants:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_metadata:latest, 628.05MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1290.4MB/s)\n" ] } ], "source": [ "from decima.data.dataset import VariantDataset\n", "\n", "dataset = VariantDataset(df_variant)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dataset prepares one_hot encoded sequence with gene mask which is ready to pass to the model:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "164" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(dataset)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'seq': tensor([[0., 1., 0., ..., 1., 0., 1.],\n", " [1., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 1., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 0., 1., 0.],\n", " [0., 0., 0., ..., 0., 0., 0.]]),\n", " 'warning': []}" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "torch.Size([5, 524288])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset[0][\"seq\"].shape" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chromposrefaltgenestartendstrandgene_mask_startgene_mask_endrel_posref_txalt_txtss_dist
0chr11000018GAFAM41C5164551040743-16384017267240725CT-123115
1chr11002308TCFAM41C5164551040743-16384017267238435AG-125405
2chr11000018GANOC2L5988611123149-163840178946123131CT-40709
3chr11002308TCNOC2L5988611123149-163840178946120841AG-42999
4chr11000018GAPERM16216451145933-163840170729145915CT-17925
.............................................
77chr1109728286TTTGGSTM5109547940110072228+163840227488180346TTTG16506
78chr1109728807TGGGSTM5109547940110072228+163840227488180867TGG17027
79chr1109727471ACALX3109710224110234512-163840174642507041TG343201
80chr1109728286TTTGALX3109710224110234512-163840174642506226AAAC342386
81chr1109728807TGGALX3109710224110234512-163840174642505705ACC341865
\n", "

82 rows × 14 columns

\n", "
" ], "text/plain": [ " chrom pos ref alt gene start end strand \\\n", "0 chr1 1000018 G A FAM41C 516455 1040743 - \n", "1 chr1 1002308 T C FAM41C 516455 1040743 - \n", "2 chr1 1000018 G A NOC2L 598861 1123149 - \n", "3 chr1 1002308 T C NOC2L 598861 1123149 - \n", "4 chr1 1000018 G A PERM1 621645 1145933 - \n", ".. ... ... ... .. ... ... ... ... \n", "77 chr1 109728286 TTT G GSTM5 109547940 110072228 + \n", "78 chr1 109728807 T GG GSTM5 109547940 110072228 + \n", "79 chr1 109727471 A C ALX3 109710224 110234512 - \n", "80 chr1 109728286 TTT G ALX3 109710224 110234512 - \n", "81 chr1 109728807 T GG ALX3 109710224 110234512 - \n", "\n", " gene_mask_start gene_mask_end rel_pos ref_tx alt_tx tss_dist \n", "0 163840 172672 40725 C T -123115 \n", "1 163840 172672 38435 A G -125405 \n", "2 163840 178946 123131 C T -40709 \n", "3 163840 178946 120841 A G -42999 \n", "4 163840 170729 145915 C T -17925 \n", ".. ... ... ... ... ... ... \n", "77 163840 227488 180346 TTT G 16506 \n", "78 163840 227488 180867 T GG 17027 \n", "79 163840 174642 507041 T G 343201 \n", "80 163840 174642 506226 AAA C 342386 \n", "81 163840 174642 505705 A CC 341865 \n", "\n", "[82 rows x 14 columns]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.variants" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's load model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact decima_rep0:latest, 2155.88MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:1.2 (1808.4MB/s)\n", "\u001b[34m\u001b[1mwandb\u001b[0m: Downloading large artifact human_state_dict_fold0:latest, 709.30MB. 1 files... \n", "\u001b[34m\u001b[1mwandb\u001b[0m: 1 of 1 files downloaded. \n", "Done. 0:0:0.5 (1487.8MB/s)\n" ] } ], "source": [ "from decima.hub import load_decima_model\n", "\n", "model = load_decima_model(device=device)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model has `predict_on_dataset` method which performs prediction for the dataset object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.\n", "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "HPU available: False, using: 0 HPUs\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n", "/home/celikm5/miniforge3/envs/decima/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: PossibleUserWarning: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=95` in the `DataLoader` to improve performance.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "24aebc5e88744bd7b1361c1fde3bcd26", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Predicting: | | 0/? [00:00" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.figure(figsize=(4, 4), dpi=200)\n", "plt.scatter(preds[0, :, 0].cpu().numpy(), preds[1, :, 0].cpu().numpy())\n", "plt.xlabel(\"gene expression for ref allele\")\n", "plt.ylabel(\"gene expression for alt allele\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" } }, "nbformat": 4, "nbformat_minor": 4 }