This proposal outlines a novel approach that leverages recent advancements in computational tools and genetic analysis to enhance variant annotation for Congenital Sucrase-Isomaltase Deficiency (CSID), facilitating clinical diagnosis and improving understanding of CSID prevalence. Previous studies have been been hindered by historical limitations in study design and data availability. By integrating genetic databases, protein simulation techniques, and variant prediction tools, this study aims to identify new variants of interest and contribute insights into the molecular and genetic pathology of CSID, its clinical implications and prevalence.
Introduction
Historical studies of CSID prevalence and its mutational landscape have been constrained by small sample sizes and homogeneous populations. Early studies that are typically cited feature sample sizes of:
In addition, recent findings indicate that the clinical presentation of CSID shares similarities with irritable bowel syndrome, raising concerns about accurate diagnosis5–7. Furthermore, the potential variability of CSID’s phenotype, including heterozygote carriers that experience symptoms, necessitates a more thorough investigation of its genetic and molecular causes, improvement of diagnoistic tools and a reevaluation of its prevalence8–10.
The significance of this research is its potential to challenge longstanding assumptions of CSID, and base disease knowledge off of a greater pool of samples sourced from population genetic databases. By leveraging computational methods and population variant data, this study aims to identify previously uncharacterized variants associated with CSID and provide a comprehensive assessment of the prevalence of deleterious mutations in the general population. The results of the study may enable more accurate diagnoses with genetic tools, and inform treatment approaches and standard of care for patients presenting with CSID or IBS-like symptoms.
Genetic Databases and Variant Prediction
Advancements in computational biology offer an opportunity to enhance the prediciton and identification of deleterious mutations and inform diagnostics and prevalence in the population.
Databases such as ClinVar and gnomAD now encompass diverse and extensive population variant data11,12. The gnomAD v2.1 dataset contains data from 125,748 exomes and 15,708 whole genomes, all mapped to the GRCh37/hg19 reference sequence. As of 17 September 2023, gnomAD v2.1.1 contains 3470 variants for SI, and contains 646 ClinVar variants. Of the 646 ClinVar variants, 29 are annotated as pathogenic/likely pathogenic, 66 have conflicting interpretations of pathogenicity, and 272 are of uncertain significance. Therefore, considering all gnomAD data, most entries do not have clinical annotations or have not been well characterized. Previous work analyzing gnomAD data for the Sucrase-Isomaltase (SI) gene to estimate disease prevalence only included known variants13 and did not analyze this population of uncharacterized variants.
GnomAD includes ClinVar clinical annotations of mutations and predictions of variant effects from the Ensembl Variant Effect Predictor (VEP)14. Predictive tools like VEP can be employed to predict the impact of uncharacterized variants. VEP predicts type of consequence – e.g. missense, frameshift, intron. This is often insufficient to assess the whether a variant is deleterious.
Fortunately, new methods have emerged: both at the protein-ligand simulation level and some protein language-based models, one of which can be applied that predicts effects for all missense mutations15. A preliminary analysis using this model is provided below, and predicts a number of deleterious missense mutations. This and other new models could be applied to other types of variants such as indels, stop gained or even intron variants15,16.
Protein Simulation and Interaction Analysis
An orthogonal method to protein language models is simulation at the protein/molecular level. Revolutionary tools like AlphaFold17 and DiffDock18 enable the modeling and evaluation of variant-protein interactions. AlphaFold can generate mutant structures from their sequences, and then these structures and their interaction with sucrose or with other proteins can be simulated using a variety of molecular simulation technologies. A promising recent tool is DiffDock18, iterations of DiffDock to enable prediction protein-protein interactions or improve prediction accuracy19,20, and more established tools such as AutoDock Vina21,22. These tools can be used to to predict the effects of SI variants on protein function, i.e. their interactions with sucrose, heterozygotic self-interactions, or interactions with other proteins. Integration of these tools is facilitated by the recently reported Foldy23, which conveniently packages AlphaFold, DiffDock, Autodock Vina, and databases into a web app that is easily deployed.
However, the sensitivity of either of these tools, protein language model based or protein simulation based, to distinguish benign and deleterious mutations is not guaranteed. Cross referencing predictions from the protein language models and simulation tools, and validating in lab will be necessary. Previous benchmarking of AlphaFold-generated structures with molecular docking had mixed results24. However, emerging tools like DiffDock and PLANTAIN may provide better outcomes. DiffDock reports an accuracy, when simulating without a known binding pocket, of 38%18. PLANTAIN attempts to improve this accuracy by optimzing the scoring function DiffDock uses, and accounts for binding pockets, and may be more fruitful20. AutoDock tools are more established but may be less accurate18; but do account for binding pockets and thus may be useful to compare predictions with21,22.
Genetic Analysis and Protein Language Models
Novel genetic analysis tools have recently emerged, providing insights into the genetic landscape of CSID. While gnomAD and ClinVar have integrated VEP, this model is limited in the scope of its predictions, and a recent protein language model may provide better predictions for missense variants. This, and other protein language models “learn how protein sequence determines many aspects of protein structure and function, including secondary structure, long-distance interactions, post-translational modifications and binding sites”15.
Preliminary analysis (an interesting figure is shown directly below, the full analysis is in the last section) employing this tool suggests a large number of missense variants of currently unknown consequence may be associated with CSID.
Using this protein language model we can predict the delteriousness of missense mutations. The model outputs a score for each missense mutation, and a threshold of -7.5 to separate deleterious/benign mutations is reported15. A lower score predicts a more severe mutation. The graph below is constructed using gnomAD data for missense variants and predictions from this model.
Code
# This cell is lengthy and is mostly just the preliminary analysis below all copied into one cell and outputting the relevant figure.import pandas as pddf = pd.read_csv("data/gnomAD_v2.1.1_SI_variants_17sep2023.csv")# gnomad.head()# create a new column 'D' as the concatenated strings of columns 'A', 'B', and 'C'df["Variant_ID"] = ( df["Chromosome"].astype(str)+"-"+ df["Position"].astype(str)+"-"+ df["Reference"].astype(str)+"-"+ df["Alternate"].astype(str))# move the new column 'D' to the far left by reindexing the columnsdf = df.reindex(columns=["Variant_ID"] + df.columns.tolist())include_annotations = ['missense_variant']mask = df['VEP Annotation'].isin(include_annotations)df_mis = df[mask]df_esm = pd.read_csv("data/ESM1b_SI.csv") import refrom Bio.SeqUtils import seq1, seq3class seqConversion:def__init__(self) ->None:pass# intended to be private method, intermediate for convertGnomadHGVS below. def _parseGnomadHGVS(self, hgvsEntry):# Define a regex pattern to match the format "stringNUMBERstring" pattern =r'([A-Za-z]+)(\d+)([A-Za-z]+)'# Use re.match to find the first match in the input string, need to account for multiple and check the duplicates if exist match = re.match(pattern, hgvsEntry)if match: # which should be, but good in case weird data i don't anticipate# Extract the three parts from the match object aa_wt = match.group(1) pos = match.group(2) aa_mut = match.group(3)# Convert the numbers to an integer (if needed) pos =int(pos)return aa_wt, pos, aa_mut# FIXME: May need more handling cases, so far this is working on SI data 22 Sep 2023else:# Handle the case where no match is foundreturnNone, None, Nonedef convertGnomadHGVS(self, hgvsEntry):# NOTE: This assumes we're only passing in the protein-level variants, which, should be the only case this function is called, i.e. missense variants# strip 'p.' hgvsEntry = hgvsEntry.lstrip('p.')# get the gnomad formatted variables aa_wt, pos, aa_mut =self._parseGnomadHGVS(hgvsEntry)if aa_wt isNoneor pos isNoneor aa_mut isNone:returnNoneelse:# codes to one-letter codes; from BioSeq import above aa_wt1 = seq1(aa_wt) aa_mut1 = seq1(aa_mut)# assemble the esm1b format of gnomad entry newFormat =f"{aa_wt1}{pos}{aa_mut1}"return newFormatdef getProteinVariants(hgvsEntry): p_pattern =r"p.*"# Use .* to match zero or more occurrences of 'p' matches = re.findall(p_pattern, hgvsEntry)if matches:return matcheselse:# Handle the case where no match is foundreturnNone# Converter = seqConversion()# df_mis['HGVS_ESM1b'] = df_mis['HGVS Consequence'].apply(Converter.convertGnomadHGVS)# to deal with error:Converter = seqConversion()# df_mis.loc[:, 'HGVS_ESM1b'] = df_mis['HGVS Consequence'].apply(Converter.convertGnomadHGVS)df_mis = df_mis.assign(HGVS_ESM1b=df_mis['HGVS Consequence'].apply(Converter.convertGnomadHGVS))merged_df = df_mis.merge(df_esm[['variant', 'score']], left_on='HGVS_ESM1b', right_on='variant', how='left')import matplotlib.pyplot as pltimport plotly.express as px# Extract the 'score' column from the merged DataFramescores = merged_df["score"]all_known_path = df['ClinVar Clinical Significance'].isin(['Pathogenic', 'Likely pathogenic', 'Pathogenic/Likely pathogenic '])akp_df = df[all_known_path]pathogenic_missense_variants = merged_df['ClinVar Clinical Significance'].isin(['Pathogenic', 'Likely pathogenic',]) # # adds 23 more samples that have varying scorespmv_df = merged_df[pathogenic_missense_variants]pmv_df = pmv_df.sort_values(by='score', ascending=True)# Filter, get benign bariantsbenign_missense_variants = merged_df['ClinVar Clinical Significance'].isin(['Benign', 'Benign/Likely benign', 'Likely benign'])# bmv: benign missense variantsbmv_df = merged_df[benign_missense_variants]bmv_df = bmv_df.sort_values(by='score', ascending=True)# Define the thresholdthreshold =-7.5# Create a new feature 'below threshold' based on the 'score' columnmerged_df['below threshold'] = merged_df['score'].apply(lambda x: "yes"if x < threshold else"no")import plotly.express as pxfig = px.histogram( merged_df, x="score", color ="below threshold", nbins=20, opacity=1, labels={"score": "Scores"}, title="Distribution of Scores Within Gnomad",)# Add markers or lines for known pathogenic variantsfor idx, row in pmv_df.iterrows(): fig.add_shape(type='line', x0=row['score'], x1=row['score'], y0=0, y1=1, opacity =0.4, xref='x', yref='paper', line=dict(color='red'))# Add markers/lines for benign variantsfor idx, row in bmv_df.iterrows(): fig.add_shape(type='line', x0=row['score'], x1=row['score'], y0=0, y1=1, opacity =0.4, xref='x', yref='paper', line=dict(color='green'))# Customize the layoutfig.update_layout(bargap=0.1)# Show the plotfig.show()
In this preliminary analysis, ClinVar annotated variants that are benign/likely benign and pathogenic/likely pathogenic are shown with green/red lines overlaid on the distribution of variants in gnomAD and their predicted deleteriousness. There are limited missense annotated variants compared to the overall dataset – but a trend does appear amongst benign mutations to be above the score threshold and be correctly separated.
Overall, there are 531 missense mutations that fall below the threshold and are predicted to be deleterious. To validate these results, a good approach may be to start from the most deleterious (negatively scored) predicted mutation and assess variants with increasingly positive scores.
Background and Collaboration
My background includes an MS in bioinformatics where I gained experience in protein analysis, 2 years in pharameceutical R&D at Eli Lilly performing physicochemical testing on antibodies and peptides to assess stability, and a BS in Chemical Engineering, BS in Molecular and Cellular Biology and BA in Biochemistry.
My motivation to study CSID stems from my own patient experience, where CSID emerged as a potential explanation. I identified a gap in the literature and saw an opportunity to leverage new computational tools to improve patient outcomes. My personal experience in obtatining a diagnosis of this disease has been arduous, and I would find it deeply gratifying to contribute to a more effective diagnostic toolset for the betterment of other patients’ care.
The preparation of this proposal has also been highly informative for career aspirations. I’m actively considering pursuing doctoral studies and this project aligns well with my goal of impacting patients through research.
The proposed project is intricate and collaboration is indispensible. Seeking guidance and partnerships with experts has been a priority from the start. Several opportunities for collaboration include:
Collaborating with Dr. D’Amato, and Dr. Naim for lab confirmation and disease expertise. I’ve also reached out to Dr. Treem, who authored some earlier reviews. Together lab confirmation can be facilitated/organized and a thorough understanding and robust DOE can be achieved.
Engage with the Foldy development team (Keasling Lab/Berkeley National Lab) for features such as DiffDock-PP or PLANTAIN that may be used and integrated into the software; the development in this project may benefit others, or integration of these tools can be conducted in parallel as part of other efforts.
Seek input from the authors of DiffDock and PLANTAIN, and other molecular simulation experts. Their insights would be invaluable for optimizing the approach and understanding the strengths and limitations of these tools.
Cost/Budget
The cost estimates stated here are retrieved from Google Cloud Pricing Calculator on 26 Sep 2023.
Using the Foldy architecture/out of the box solution for AlphaFold and molecular simulation, computation will run on Google Cloud. The SI complex is about 1500 residues; Foldy documentation suggests that 3000 residue complexes be run on a machine type that costs $3700 a month. Therefore a mid-range machine may be appropriate, one of which costs $2682 a month. Low-range machines of about $420-1509 a month are also available but are unlikely to have sufficient resources to run the SI complex. The cost estimates are attached as pdfs.
The scope of the final project will dictate how many variants are run and the implementation of other tools beyond what is included in Foldy (Foldy includes AlphaFold, DiffDock and AutoDock Vina along with database integration). These tools will likely run on the same machine but require additional development time.
Timeline and Milestones
This section to be finalized after nailing down lab confirmation
(Ask about lab confirmation timeline)
This study proposes computational methods and laboratory confirmation. The protein language models are relatively cheap to run, whereas the infrastructure and machines for AlphaFold and molecular simulations are more intensive.
Further, there are models not included in Foldy. Timeline to implementation of different methods will be influenced by potential collaboration partners and documentation of methods, namely PLANTAIN and other protein language models. Using Foldy as-is however, deployment is finished in a few days (time for machine to boot and download relevant databases), after which simulations can be run. Initially, scripts would be developed to construct sequences from Ensembl/UniProt consensus sequence, gnomAD data, or other data sources.
Laboratory confirmation can occur after all results have been generated, or after smaller bolus runs of the simulations. This strategy may be superior to inform model optimization and feasibility.
A reasonable order of events or timeline may look like the below:
Validate protein language model: Lab validation timeline?)
Initial protein simulation analysis: setup of as-is of Foldy and development, validation of scripts to generate mutant protein sequences (2 weeks)
Running a small sample, validate protein simulation strategy; unknown how much time to generate each SI complex, molecular simulation should be relatively quick
Lab confirmation/validation of Fold/docking results(??)
Integration and validation of other models, e.g. other language models, or other simulation like PLANTAIN, DiffDock-PP
More lab confirmation
Therefore within dev/simulation + lab confirmation some idea of validity of models and feasibility can be achieved. Then, conservatively, within … a few months of development and further lab confirmation…? the overall feasibilty and scope of the project can be assessed.
Preliminary Analysis
As of 17 September 2023, the gnomAD v2.1.1 dataset contains 3470 variants, and ClinVar contains 988 variants. The ClinVar annotations are:
Code
"""This cell does some analysis on gnomAD data for the SI gene. As of 17 Sep 2023, the gnomAD page for SI, https://gnomad.broadinstitute.org/gene/ENSG00000090402?dataset=gnomad_r2_1contains 988 ClinVar variants, and 3470 gnomAD variants. Let's look at how many have been clinically annotated, and how many are present in gnomAD. Then we'll cross-reference with the ESMB1 data below to suggest how many could be deleterious, via that model."""import pandas as pddf = pd.read_csv("data/gnomAD_v2.1.1_SI_variants_17sep2023.csv")# gnomad.head()# create a new column 'D' as the concatenated strings of columns 'A', 'B', and 'C'df["Variant_ID"] = ( df["Chromosome"].astype(str)+"-"+ df["Position"].astype(str)+"-"+ df["Reference"].astype(str)+"-"+ df["Alternate"].astype(str))# move the new column 'D' to the far left by reindexing the columnsdf = df.reindex(columns=["Variant_ID"] + df.columns.tolist())# display the updated dataframe# df.head()# demonstrates 3470 variants in gnomAD as of this date. df.describe()# 640 variants for this disease, of which there are the 8 unique annotations# df["ClinVar Clinical Significance"].describe()# demonstrates the majority are of unknown consequence and could be examined print("The below is the makeup of ClinVar variants...\n",df["ClinVar Clinical Significance"].value_counts()) # /640, 17Sep2023
The below is the makeup of ClinVar variants...
ClinVar Clinical Significance
Uncertain significance 272
Likely benign 216
Conflicting interpretations of pathogenicity 66
Benign 34
Benign/Likely benign 23
Pathogenic 16
Likely pathogenic 10
Pathogenic/Likely pathogenic 3
Name: count, dtype: int64
We can also look at the types of variants within gnomAD and their consequence, including variants that are not in ClinVar and have not been clinically annotated. This is the overall makeup of the entire gnomAD dataset:
Code
##### ok now let me look at what other stuff we have to examine# df.columnsdf["VEP Annotation"].value_counts()
In this study, I propose to employ protein-level tools to predict deleteriousness. In this preliminary analysis the variants that are within gnomAD are paired with their predicted severity from the ESM1b model. We’ll only be able to examine the missense variants with this particular tool; let’s now see how many missense variants have been clinically annotated:
Code
# we can use missense with ESM1b -- some like frameshift, stop gained, um, inframe del, can be examined further or be presumed to be deleterious I'd think. n = 1 protein_altering, this may need to be treated manually the annotation is funky# Define the annotation types to include (as of 20 Sep 2023, and for this analysis, just missense)include_annotations = ['missense_variant']mask = df['VEP Annotation'].isin(include_annotations)df_mis = df[mask]# print(df_mis)print(df_mis['ClinVar Clinical Significance'].value_counts(),"\nTotal of the gnomAD missense variants in ClinVar: ",df_mis['ClinVar Clinical Significance'].value_counts().sum())
ClinVar Clinical Significance
Uncertain significance 236
Conflicting interpretations of pathogenicity 23
Likely benign 19
Benign 7
Benign/Likely benign 4
Likely pathogenic 2
Pathogenic 1
Name: count, dtype: int64
Total of the gnomAD missense variants in ClinVar: 292
There are 1141 missense mutants recorded in gnomAD. Of the missense mutations that are in Clinvar, the majority are of uncertain significance, 236/292. The ESM1b model has predictions for all possible missense mutations. The next couple code cells serve to get the same format between gnomAD data and ESM1b to describe variants.
(And an aside, the ESM1b model is also capable of accounting for stop gained and inframe deletions, of which there are 55 and 15 recorded in gnomAD; it requires more labor and the sequences themselves, though. There’s also all the intronic variants, these are outside of scope for now.)
Read in the data
Code
# Read in the datadf_esm = pd.read_csv("data/ESM1b_SI.csv")
Define necessary functions
Code
# Define relevant functions/class for later use import refrom Bio.SeqUtils import seq1, seq3class seqConversion:def__init__(self) ->None:pass# intended to be private method, intermediate for convertGnomadHGVS below. def _parseGnomadHGVS(self, hgvsEntry):# Define a regex pattern to match the format "stringNUMBERstring" pattern =r'([A-Za-z]+)(\d+)([A-Za-z]+)'# Use re.match to find the first match in the input string, need to account for multiple and check the duplicates if exist match = re.match(pattern, hgvsEntry)if match: # which should be, but good in case weird data i don't anticipate# Extract the three parts from the match object aa_wt = match.group(1) pos = match.group(2) aa_mut = match.group(3)# Convert the numbers to an integer (if needed) pos =int(pos)return aa_wt, pos, aa_mut# FIXME: May need more handling cases, so far this is working on SI data 22 Sep 2023else:# Handle the case where no match is foundreturnNone, None, Nonedef convertGnomadHGVS(self, hgvsEntry):# NOTE: This assumes we're only passing in the protein-level variants, which, should be the only case this function is called, i.e. missense variants# strip 'p.' hgvsEntry = hgvsEntry.lstrip('p.')# get the gnomad formatted variables aa_wt, pos, aa_mut =self._parseGnomadHGVS(hgvsEntry)if aa_wt isNoneor pos isNoneor aa_mut isNone:returnNoneelse:# codes to one-letter codes; from BioSeq import above aa_wt1 = seq1(aa_wt) aa_mut1 = seq1(aa_mut)# assemble the esm1b format of gnomad entry newFormat =f"{aa_wt1}{pos}{aa_mut1}"return newFormatdef getProteinVariants(hgvsEntry): p_pattern =r"p.*"# Use .* to match zero or more occurrences of 'p' matches = re.findall(p_pattern, hgvsEntry)if matches:return matcheselse:# Handle the case where no match is foundreturnNone
We now finally have: - Only missense variants in this dataframe - Functions for converson in the class definition, used in Converter object - Now we can define new compatible format as a feature in gnomAD data, can cross reference gnomAD <> ESM1b model
So – define the 1-letter, ESM1b variant format as a new feature in gnomAD dataframe
Code
# we defined this above, can do again for clarity# Converter = seqConversion()# df_mis['HGVS_ESM1b'] = df_mis['HGVS Consequence'].apply(Converter.convertGnomadHGVS)# df_mis['HGVS_ESM1b'].head()Converter = seqConversion()# df_mis.loc[:, 'HGVS_ESM1b'] = df_mis['HGVS Consequence'].apply(Converter.convertGnomadHGVS)df_mis = df_mis.assign(HGVS_ESM1b=df_mis['HGVS Consequence'].apply(Converter.convertGnomadHGVS))
Feature to enable cross referencing now defined.
So, now we can update our two dataframes, gnomAD and ESM1b. First, add the score feature into gnomAD.
Code
# Merge the 'score' feature from df_scores into df_mis based on matching 'variant' or 'HGVS_ESM1b' valuesmerged_df = df_mis.merge(df_esm[['variant', 'score']], left_on='HGVS_ESM1b', right_on='variant', how='left')# merged_df.head()
Now let’s see the distribution of scores within gnomAD.
Code
import matplotlib.pyplot as pltimport plotly.express as px# Extract the 'score' column from the merged DataFramescores = merged_df["score"]fig = px.histogram( merged_df, x="score", nbins=20, opacity=1, labels={"score": "Scores"}, title="Distribution of Scores Within Gnomad",)# Customize the layoutfig.update_layout(bargap=0.1)# Show the plotfig.show()
In their paper, the authors of the model found good separation at a threshold of -7.5:
Code
# Count the number of entries with a score below -15count_below_minus_75 = (merged_df['score'] <-7.5).sum()# Print the countprint(f'Number of entries with a score below -7.5: {count_below_minus_75}')
Number of entries with a score below -7.5: 531
For the missense variants in gnomAD, this gives a predicted 531 deleterious variants. This is far more numerous than the current number of known pathogenic variants, and only a few known missense variants can be used to compare against this prediction. The below two histograms will attempt to gage how useful this threshold may actually be in this scenario.
And again, this is only for missense mutations.
The below shows the three known/likely MISSENSE mutations that are deleterious
Code
# only got 3 that are missense, 23 conflicting. versus...merged_df['ClinVar Clinical Significance'].value_counts()
29 pathogenic/likely pathogenic in ClinVar – these other variants, the 26 besides the 3 missense, are frameshit, stop gained and splice effect variants. The ESM1b model is capable of also examining stop gained and indel variants, but requires more intense analysis.
The next couple cells define and show dataframes for the pathogenic and benign variants to enable marking the figures. The authors of ESM1b note that a score threshold of -7.5 tends to give good separation between delterious and benign mutations.
Pathogenic missense dataframe:
Code
# Filter the 'gnomad' DataFrame to select pathogenic variantspathogenic_missense_variants = merged_df['ClinVar Clinical Significance'].isin(['Pathogenic', 'Likely pathogenic',]) # # adds 23 more samples that have varying scorespmv_df = merged_df[pathogenic_missense_variants]pmv_df = pmv_df.sort_values(by='score', ascending=True)print("Only pathogenic, and likely pathogenic")pmv_df[['variant','score']]
With the very limited sample size of 3, all known/likely pathogenic missense variants’ score is below -7.5. And known/likely benign tend to be above -7.5. However, limited data precludes any real inference here.
Create the feature first, the plotly express API does not accept conditional statements for coloring.
Code
# Define the thresholdthreshold =-7.5# Create a new feature 'below threshold' based on the 'score' columnmerged_df['below threshold'] = merged_df['score'].apply(lambda x: "yes"if x < threshold else"no")
Code
# import matplotlib.pyplot as pltimport plotly.express as pxfig = px.histogram( merged_df, x="score", color ="below threshold", nbins=20, opacity=1, labels={"score": "Scores"}, title="Distribution of Scores Within Gnomad",)# Add markers or lines for known pathogenic variantsfor idx, row in pmv_df.iterrows(): fig.add_shape(type='line', x0=row['score'], x1=row['score'], y0=0, y1=1, opacity =0.4, xref='x', yref='paper', line=dict(color='red'))# Add markers/lines for benign variantsfor idx, row in bmv_df.iterrows(): fig.add_shape(type='line', x0=row['score'], x1=row['score'], y0=0, y1=1, opacity =0.4, xref='x', yref='paper', line=dict(color='green'))# Customize the layoutfig.update_layout(bargap=0.1)# Show the plotfig.show()
This is the most informative figure in this preliminary analysis. We see that, yes, the benign tend to be greater than -7.5; but there are more benign samples below the -7.5 threshold than there are pathogenic samples, and overall we have very limited data here to work with. The threshold can’t be proven in our circumstance to be useful. There is at least a trend for benign variants to be above the threshold.
But we do have recorded variants at the lowest score that we could start examining first, and at least identify some new variants that can be added. These can be modeled in AlphaFold and then examined manually, and put through the simulation pipeline.
And so, for the sake of seeing what this most deleterious variant can demonstrate if modeled in AlphaFold… let’s get the lowest scored missense mutation is.
If we look for this mutation in P14410 in UniProt, https://www.ebi.ac.uk/ProtVar/query?search=P14410%20L13P it’s also predicted deleterious by CADD, a model from 2021 https://cadd.gs.washington.edu/info. We can also cross reference with this model in future, but ESM1b in its paper demonstrates superiority.
Chen, S. et al. A genome-wide mutational constraint map quantified from variation in 76,156 human genomes. 2022.03.20.485034 (2022) doi:10.1101/2022.03.20.485034.
Brandes, N., Goldman, G., Wang, C. H., Ye, C. J. & Ntranos, V. Genome-wide prediction of disease variant effects with a deep protein language model. Nature Genetics (2023) doi:10.1038/s41588-023-01465-0.
16.
Kurosawa, R. et al.PDIVAS: Pathogenicity predictor for Deep-Intronic Variants causing Aberrant Splicing. 2023.03.20.23287464 (2023) doi:10.1101/2023.03.20.23287464.