Sweet Dreams and Genetic Schemes: Improving CSID Variant Annotation through Integrated Computational and Genetic Analysis

Note: light/dark mode button in top right.

Code is available on my github.

Abstract

This proposal outlines a novel approach that leverages recent advancements in computational tools and genetic analysis to enhance variant annotation for Congenital Sucrase-Isomaltase Deficiency (CSID), facilitating clinical diagnosis and improving understanding of CSID prevalence. Previous studies have been been hindered by historical limitations in study design and data availability. By integrating genetic databases, protein simulation techniques, and variant prediction tools, this study aims to identify new variants of interest and contribute insights into the molecular and genetic pathology of CSID, its clinical implications and prevalence.

Introduction

Historical studies of CSID prevalence and its mutational landscape have been constrained by small sample sizes and homogeneous populations. Early studies that are typically cited feature sample sizes of:

3 in CSID’s first description¹
399, of which 339 were Caucasian²
56 Canadian Eskimos³
97 Greenlanders⁴

In addition, recent findings indicate that the clinical presentation of CSID shares similarities with irritable bowel syndrome, raising concerns about accurate diagnosis^5–7. Furthermore, the potential variability of CSID’s phenotype, including heterozygote carriers that experience symptoms, necessitates a more thorough investigation of its genetic and molecular causes, improvement of diagnoistic tools and a reevaluation of its prevalence^8–10.

The significance of this research is its potential to challenge longstanding assumptions of CSID, and base disease knowledge off of a greater pool of samples sourced from population genetic databases. By leveraging computational methods and population variant data, this study aims to identify previously uncharacterized variants associated with CSID and provide a comprehensive assessment of the prevalence of deleterious mutations in the general population. The results of the study may enable more accurate diagnoses with genetic tools, and inform treatment approaches and standard of care for patients presenting with CSID or IBS-like symptoms.

Genetic Databases and Variant Prediction

Advancements in computational biology offer an opportunity to enhance the prediciton and identification of deleterious mutations and inform diagnostics and prevalence in the population.

Databases such as ClinVar and gnomAD now encompass diverse and extensive population variant data^11,12. The gnomAD v2.1 dataset contains data from 125,748 exomes and 15,708 whole genomes, all mapped to the GRCh37/hg19 reference sequence. As of 17 September 2023, gnomAD v2.1.1 contains 3470 variants for SI, and contains 646 ClinVar variants. Of the 646 ClinVar variants, 29 are annotated as pathogenic/likely pathogenic, 66 have conflicting interpretations of pathogenicity, and 272 are of uncertain significance. Therefore, considering all gnomAD data, most entries do not have clinical annotations or have not been well characterized. Previous work analyzing gnomAD data for the Sucrase-Isomaltase (SI) gene to estimate disease prevalence only included known variants¹³ and did not analyze this population of uncharacterized variants.

GnomAD includes ClinVar clinical annotations of mutations and predictions of variant effects from the Ensembl Variant Effect Predictor (VEP)¹⁴. Predictive tools like VEP can be employed to predict the impact of uncharacterized variants. VEP predicts type of consequence – e.g. missense, frameshift, intron. This is often insufficient to assess the whether a variant is deleterious.

Fortunately, new methods have emerged: both at the protein-ligand simulation level and some protein language-based models, one of which can be applied that predicts effects for all missense mutations¹⁵. A preliminary analysis using this model is provided below, and predicts a number of deleterious missense mutations. This and other new models could be applied to other types of variants such as indels, stop gained or even intron variants^15,16.

Protein Simulation and Interaction Analysis

An orthogonal method to protein language models is simulation at the protein/molecular level. Revolutionary tools like AlphaFold¹⁷ and DiffDock¹⁸ enable the modeling and evaluation of variant-protein interactions. AlphaFold can generate mutant structures from their sequences, and then these structures and their interaction with sucrose or with other proteins can be simulated using a variety of molecular simulation technologies. A promising recent tool is DiffDock¹⁸, iterations of DiffDock to enable prediction protein-protein interactions or improve prediction accuracy^19,20, and more established tools such as AutoDock Vina^21,22. These tools can be used to to predict the effects of SI variants on protein function, i.e. their interactions with sucrose, heterozygotic self-interactions, or interactions with other proteins. Integration of these tools is facilitated by the recently reported Foldy²³, which conveniently packages AlphaFold, DiffDock, Autodock Vina, and databases into a web app that is easily deployed.

However, the sensitivity of either of these tools, protein language model based or protein simulation based, to distinguish benign and deleterious mutations is not guaranteed. Cross referencing predictions from the protein language models and simulation tools, and validating in lab will be necessary. Previous benchmarking of AlphaFold-generated structures with molecular docking had mixed results²⁴. However, emerging tools like DiffDock and PLANTAIN may provide better outcomes. DiffDock reports an accuracy, when simulating without a known binding pocket, of 38%¹⁸. PLANTAIN attempts to improve this accuracy by optimzing the scoring function DiffDock uses, and accounts for binding pockets, and may be more fruitful²⁰. AutoDock tools are more established but may be less accurate¹⁸; but do account for binding pockets and thus may be useful to compare predictions with^21,22.

Genetic Analysis and Protein Language Models

Novel genetic analysis tools have recently emerged, providing insights into the genetic landscape of CSID. While gnomAD and ClinVar have integrated VEP, this model is limited in the scope of its predictions, and a recent protein language model may provide better predictions for missense variants. This, and other protein language models “learn how protein sequence determines many aspects of protein structure and function, including secondary structure, long-distance interactions, post-translational modifications and binding sites”¹⁵.

Preliminary analysis (an interesting figure is shown directly below, the full analysis is in the last section) employing this tool suggests a large number of missense variants of currently unknown consequence may be associated with CSID.

Using this protein language model we can predict the delteriousness of missense mutations. The model outputs a score for each missense mutation, and a threshold of -7.5 to separate deleterious/benign mutations is reported¹⁵. A lower score predicts a more severe mutation. The graph below is constructed using gnomAD data for missense variants and predictions from this model.

Code

# This cell is lengthy and is mostly just the preliminary analysis below all copied into one cell and outputting the relevant figure.
import pandas as pd

df = pd.read_csv("data/gnomAD_v2.1.1_SI_variants_17sep2023.csv")
# gnomad.head()

# create a new column 'D' as the concatenated strings of columns 'A', 'B', and 'C'
df["Variant_ID"] = (
    df["Chromosome"].astype(str)
    + "-"
    + df["Position"].astype(str)
    + "-"
    + df["Reference"].astype(str)
    + "-"
    + df["Alternate"].astype(str)
)

# move the new column 'D' to the far left by reindexing the columns
df = df.reindex(columns=["Variant_ID"] + df.columns.tolist())
include_annotations = ['missense_variant']
mask = df['VEP Annotation'].isin(include_annotations)
df_mis = df[mask]
df_esm = pd.read_csv("data/ESM1b_SI.csv") 

import re
from Bio.SeqUtils import seq1, seq3
   
class seqConversion:
  def __init__(self) -> None:
      pass
  
  # intended to be private method, intermediate for convertGnomadHGVS below.      
  def _parseGnomadHGVS(self, hgvsEntry):
    # Define a regex pattern to match the format "stringNUMBERstring"
    pattern = r'([A-Za-z]+)(\d+)([A-Za-z]+)'

    # Use re.match to find the first match in the input string, need to account for multiple and check the duplicates if exist
    match = re.match(pattern, hgvsEntry)

    if match: # which should be, but good in case weird data i don't anticipate
        # Extract the three parts from the match object
        aa_wt = match.group(1)
        pos = match.group(2)
        aa_mut = match.group(3)

        # Convert the numbers to an integer (if needed)
        pos = int(pos)

        return aa_wt, pos, aa_mut
    # FIXME: May need more handling cases, so far this is working on SI data 22 Sep 2023
    else:
        # Handle the case where no match is found
        return None, None, None

  def convertGnomadHGVS(self, hgvsEntry):
    # NOTE: This assumes we're only passing in the protein-level variants, which, should be the only case this function is called, i.e. missense variants
    
    # strip 'p.'
    hgvsEntry = hgvsEntry.lstrip('p.')
    
    # get the gnomad formatted variables
    aa_wt, pos, aa_mut = self._parseGnomadHGVS(hgvsEntry)
    
    if aa_wt is None or pos is None or aa_mut is None:
            return None
    else:
      # codes to one-letter codes; from BioSeq import above
      aa_wt1 = seq1(aa_wt)
      aa_mut1 = seq1(aa_mut)

      # assemble the esm1b format of gnomad entry
      newFormat = f"{aa_wt1}{pos}{aa_mut1}"
      return newFormat
 
  def getProteinVariants(hgvsEntry):
    p_pattern = r"p.*"  # Use .* to match zero or more occurrences of 'p'
    matches = re.findall(p_pattern, hgvsEntry)

    if matches:
        return matches
    else:
        # Handle the case where no match is found
        return None

# Converter = seqConversion()
# df_mis['HGVS_ESM1b']  = df_mis['HGVS Consequence'].apply(Converter.convertGnomadHGVS)

# to deal with error:
Converter = seqConversion()
# df_mis.loc[:, 'HGVS_ESM1b'] = df_mis['HGVS Consequence'].apply(Converter.convertGnomadHGVS)
df_mis = df_mis.assign(HGVS_ESM1b=df_mis['HGVS Consequence'].apply(Converter.convertGnomadHGVS))


merged_df = df_mis.merge(df_esm[['variant', 'score']], left_on='HGVS_ESM1b', right_on='variant', how='left')

import matplotlib.pyplot as plt
import plotly.express as px

# Extract the 'score' column from the merged DataFrame
scores = merged_df["score"]


all_known_path = df['ClinVar Clinical Significance'].isin(['Pathogenic', 'Likely pathogenic', 'Pathogenic/Likely pathogenic  '])
akp_df = df[all_known_path]

pathogenic_missense_variants = merged_df['ClinVar Clinical Significance'].isin(['Pathogenic', 'Likely pathogenic',]) # # adds 23 more samples that have varying scores

pmv_df = merged_df[pathogenic_missense_variants]
pmv_df = pmv_df.sort_values(by='score', ascending=True)

# Filter, get benign bariants
benign_missense_variants = merged_df['ClinVar Clinical Significance'].isin(['Benign', 'Benign/Likely benign', 'Likely benign'])

# bmv: benign missense variants
bmv_df = merged_df[benign_missense_variants]
bmv_df = bmv_df.sort_values(by='score', ascending=True)

# Define the threshold
threshold = -7.5

# Create a new feature 'below threshold' based on the 'score' column
merged_df['below threshold'] = merged_df['score'].apply(lambda x: "yes" if x < threshold else "no")

import plotly.express as px

fig = px.histogram(
    merged_df,
    x="score",
    color = "below threshold",
    nbins=20,
    opacity=1,
    labels={"score": "Scores"},
    title="Distribution of Scores Within Gnomad",
)

# Add markers or lines for known pathogenic variants
for idx, row in pmv_df.iterrows():
    fig.add_shape(type='line', x0=row['score'], x1=row['score'], y0=0, y1=1, opacity = 0.4,
                  xref='x', yref='paper', line=dict(color='red'))

# Add markers/lines for benign variants
for idx, row in bmv_df.iterrows():
    fig.add_shape(type='line', x0=row['score'], x1=row['score'], y0=0, y1=1, opacity = 0.4,
                  xref='x', yref='paper', line=dict(color='green'))

# Customize the layout
fig.update_layout(bargap=0.1)

# Show the plot
fig.show()

In this preliminary analysis, ClinVar annotated variants that are benign/likely benign and pathogenic/likely pathogenic are shown with green/red lines overlaid on the distribution of variants in gnomAD and their predicted deleteriousness. There are limited missense annotated variants compared to the overall dataset – but a trend does appear amongst benign mutations to be above the score threshold and be correctly separated.

Overall, there are 531 missense mutations that fall below the threshold and are predicted to be deleterious. To validate these results, a good approach may be to start from the most deleterious (negatively scored) predicted mutation and assess variants with increasingly positive scores.

Background and Collaboration

My background includes an MS in bioinformatics where I gained experience in protein analysis, 2 years in pharameceutical R&D at Eli Lilly performing physicochemical testing on antibodies and peptides to assess stability, and a BS in Chemical Engineering, BS in Molecular and Cellular Biology and BA in Biochemistry.

My motivation to study CSID stems from my own patient experience, where CSID emerged as a potential explanation. I identified a gap in the literature and saw an opportunity to leverage new computational tools to improve patient outcomes. My personal experience in obtatining a diagnosis of this disease has been arduous, and I would find it deeply gratifying to contribute to a more effective diagnostic toolset for the betterment of other patients’ care.

The preparation of this proposal has also been highly informative for career aspirations. I’m actively considering pursuing doctoral studies and this project aligns well with my goal of impacting patients through research.

The proposed project is intricate and collaboration is indispensible. Seeking guidance and partnerships with experts has been a priority from the start. Several opportunities for collaboration include:

Collaborating with Dr. D’Amato, and Dr. Naim for lab confirmation and disease expertise. I’ve also reached out to Dr. Treem, who authored some earlier reviews. Together lab confirmation can be facilitated/organized and a thorough understanding and robust DOE can be achieved.
Engage with the Foldy development team (Keasling Lab/Berkeley National Lab) for features such as DiffDock-PP or PLANTAIN that may be used and integrated into the software; the development in this project may benefit others, or integration of these tools can be conducted in parallel as part of other efforts.
Seek input from the authors of DiffDock and PLANTAIN, and other molecular simulation experts. Their insights would be invaluable for optimizing the approach and understanding the strengths and limitations of these tools.

Cost/Budget

The cost estimates stated here are retrieved from Google Cloud Pricing Calculator on 26 Sep 2023.

Using the Foldy architecture/out of the box solution for AlphaFold and molecular simulation, computation will run on Google Cloud. The SI complex is about 1500 residues; Foldy documentation suggests that 3000 residue complexes be run on a machine type that costs $3700 a month. Therefore a mid-range machine may be appropriate, one of which costs $2682 a month. Low-range machines of about $420-1509 a month are also available but are unlikely to have sufficient resources to run the SI complex. The cost estimates are attached as pdfs.

The scope of the final project will dictate how many variants are run and the implementation of other tools beyond what is included in Foldy (Foldy includes AlphaFold, DiffDock and AutoDock Vina along with database integration). These tools will likely run on the same machine but require additional development time.

Timeline and Milestones

This section to be finalized after nailing down lab confirmation

(Ask about lab confirmation timeline)

This study proposes computational methods and laboratory confirmation. The protein language models are relatively cheap to run, whereas the infrastructure and machines for AlphaFold and molecular simulations are more intensive.

Further, there are models not included in Foldy. Timeline to implementation of different methods will be influenced by potential collaboration partners and documentation of methods, namely PLANTAIN and other protein language models. Using Foldy as-is however, deployment is finished in a few days (time for machine to boot and download relevant databases), after which simulations can be run. Initially, scripts would be developed to construct sequences from Ensembl/UniProt consensus sequence, gnomAD data, or other data sources.

Laboratory confirmation can occur after all results have been generated, or after smaller bolus runs of the simulations. This strategy may be superior to inform model optimization and feasibility.

A reasonable order of events or timeline may look like the below:

Validate protein language model: Lab validation timeline?)
Initial protein simulation analysis: setup of as-is of Foldy and development, validation of scripts to generate mutant protein sequences (2 weeks)
Running a small sample, validate protein simulation strategy; unknown how much time to generate each SI complex, molecular simulation should be relatively quick
Lab confirmation/validation of Fold/docking results(??)
Integration and validation of other models, e.g. other language models, or other simulation like PLANTAIN, DiffDock-PP
More lab confirmation

Therefore within dev/simulation + lab confirmation some idea of validity of models and feasibility can be achieved. Then, conservatively, within … a few months of development and further lab confirmation…? the overall feasibilty and scope of the project can be assessed.

Preliminary Analysis

As of 17 September 2023, the gnomAD v2.1.1 dataset contains 3470 variants, and ClinVar contains 988 variants. The ClinVar annotations are:

Code

"""
This cell does some analysis on gnomAD data for the SI gene. 

As of 17 Sep 2023, the gnomAD page for SI, https://gnomad.broadinstitute.org/gene/ENSG00000090402?dataset=gnomad_r2_1

contains 988 ClinVar variants, and 3470 gnomAD variants. Let's look at how many have been clinically annotated, and how many are present in gnomAD. Then we'll cross-reference with the ESMB1 data below to suggest how many could be deleterious, via that model.
"""
import pandas as pd

df = pd.read_csv("data/gnomAD_v2.1.1_SI_variants_17sep2023.csv")
# gnomad.head()

# create a new column 'D' as the concatenated strings of columns 'A', 'B', and 'C'
df["Variant_ID"] = (
    df["Chromosome"].astype(str)
    + "-"
    + df["Position"].astype(str)
    + "-"
    + df["Reference"].astype(str)
    + "-"
    + df["Alternate"].astype(str)
)

# move the new column 'D' to the far left by reindexing the columns
df = df.reindex(columns=["Variant_ID"] + df.columns.tolist())

# display the updated dataframe
# df.head()

# demonstrates 3470 variants in gnomAD as of this date. 
df.describe()

# 640 variants for this disease, of which there are the 8 unique annotations
# df["ClinVar Clinical Significance"].describe()

# demonstrates the majority are of unknown consequence and could be examined 
print("The below is the makeup of ClinVar variants...\n",
df["ClinVar Clinical Significance"].value_counts()) # /640, 17Sep2023

The below is the makeup of ClinVar variants...
 ClinVar Clinical Significance
Uncertain significance                          272
Likely benign                                   216
Conflicting interpretations of pathogenicity     66
Benign                                           34
Benign/Likely benign                             23
Pathogenic                                       16
Likely pathogenic                                10
Pathogenic/Likely pathogenic                      3
Name: count, dtype: int64

We can also look at the types of variants within gnomAD and their consequence, including variants that are not in ClinVar and have not been clinically annotated. This is the overall makeup of the entire gnomAD dataset:

Code

####
# ok now let me look at what other stuff we have to examine
# df.columns

df["VEP Annotation"].value_counts()

VEP Annotation
intron_variant              1517
missense_variant            1141
synonymous_variant           411
splice_region_variant        191
frameshift_variant            57
stop_gained                   55
splice_donor_variant          32
3_prime_UTR_variant           24
splice_acceptor_variant       24
inframe_deletion              15
start_lost                     2
protein_altering_variant       1
Name: count, dtype: int64

In this study, I propose to employ protein-level tools to predict deleteriousness. In this preliminary analysis the variants that are within gnomAD are paired with their predicted severity from the ESM1b model. We’ll only be able to examine the missense variants with this particular tool; let’s now see how many missense variants have been clinically annotated:

Code

# we can use missense with ESM1b -- some like frameshift, stop gained, um, inframe del, can be examined further or be presumed to be deleterious I'd think. n = 1 protein_altering, this may need to be treated manually the annotation is funky

# Define the annotation types to include (as of 20 Sep 2023, and for this analysis, just missense)
include_annotations = ['missense_variant']
mask = df['VEP Annotation'].isin(include_annotations)
df_mis = df[mask]
# print(df_mis)
print(
df_mis['ClinVar Clinical Significance'].value_counts(),
"\nTotal of the gnomAD missense variants in ClinVar: ",
df_mis['ClinVar Clinical Significance'].value_counts().sum()
)

ClinVar Clinical Significance
Uncertain significance                          236
Conflicting interpretations of pathogenicity     23
Likely benign                                    19
Benign                                            7
Benign/Likely benign                              4
Likely pathogenic                                 2
Pathogenic                                        1
Name: count, dtype: int64 
Total of the gnomAD missense variants in ClinVar:  292

There are 1141 missense mutants recorded in gnomAD. Of the missense mutations that are in Clinvar, the majority are of uncertain significance, 236/292. The ESM1b model has predictions for all possible missense mutations. The next couple code cells serve to get the same format between gnomAD data and ESM1b to describe variants.

(And an aside, the ESM1b model is also capable of accounting for stop gained and inframe deletions, of which there are 55 and 15 recorded in gnomAD; it requires more labor and the sequences themselves, though. There’s also all the intronic variants, these are outside of scope for now.)

Read in the data

Code

# Read in the data
df_esm = pd.read_csv("data/ESM1b_SI.csv")

Define necessary functions

Code

# Define relevant functions/class for later use 
import re
from Bio.SeqUtils import seq1, seq3
   
class seqConversion:
  def __init__(self) -> None:
      pass
  
  # intended to be private method, intermediate for convertGnomadHGVS below.      
  def _parseGnomadHGVS(self, hgvsEntry):
    # Define a regex pattern to match the format "stringNUMBERstring"
    pattern = r'([A-Za-z]+)(\d+)([A-Za-z]+)'

    # Use re.match to find the first match in the input string, need to account for multiple and check the duplicates if exist
    match = re.match(pattern, hgvsEntry)

    if match: # which should be, but good in case weird data i don't anticipate
        # Extract the three parts from the match object
        aa_wt = match.group(1)
        pos = match.group(2)
        aa_mut = match.group(3)

        # Convert the numbers to an integer (if needed)
        pos = int(pos)

        return aa_wt, pos, aa_mut
    # FIXME: May need more handling cases, so far this is working on SI data 22 Sep 2023
    else:
        # Handle the case where no match is found
        return None, None, None

  def convertGnomadHGVS(self, hgvsEntry):
    # NOTE: This assumes we're only passing in the protein-level variants, which, should be the only case this function is called, i.e. missense variants
    
    # strip 'p.'
    hgvsEntry = hgvsEntry.lstrip('p.')
    
    # get the gnomad formatted variables
    aa_wt, pos, aa_mut = self._parseGnomadHGVS(hgvsEntry)
    
    if aa_wt is None or pos is None or aa_mut is None:
            return None
    else:
      # codes to one-letter codes; from BioSeq import above
      aa_wt1 = seq1(aa_wt)
      aa_mut1 = seq1(aa_mut)

      # assemble the esm1b format of gnomad entry
      newFormat = f"{aa_wt1}{pos}{aa_mut1}"
      return newFormat
 
  def getProteinVariants(hgvsEntry):
    p_pattern = r"p.*"  # Use .* to match zero or more occurrences of 'p'
    matches = re.findall(p_pattern, hgvsEntry)

    if matches:
        return matches
    else:
        # Handle the case where no match is found
        return None

We now finally have: - Only missense variants in this dataframe - Functions for converson in the class definition, used in Converter object - Now we can define new compatible format as a feature in gnomAD data, can cross reference gnomAD <> ESM1b model

So – define the 1-letter, ESM1b variant format as a new feature in gnomAD dataframe

Code

# we defined this above, can do again for clarity
# Converter = seqConversion()
# df_mis['HGVS_ESM1b']  = df_mis['HGVS Consequence'].apply(Converter.convertGnomadHGVS)
# df_mis['HGVS_ESM1b'].head()
Converter = seqConversion()
# df_mis.loc[:, 'HGVS_ESM1b'] = df_mis['HGVS Consequence'].apply(Converter.convertGnomadHGVS)
df_mis = df_mis.assign(HGVS_ESM1b=df_mis['HGVS Consequence'].apply(Converter.convertGnomadHGVS))

Feature to enable cross referencing now defined.

So, now we can update our two dataframes, gnomAD and ESM1b. First, add the score feature into gnomAD.

Code

# Merge the 'score' feature from df_scores into df_mis based on matching 'variant' or 'HGVS_ESM1b' values
merged_df = df_mis.merge(df_esm[['variant', 'score']], left_on='HGVS_ESM1b', right_on='variant', how='left')

# merged_df.head()

Now let’s see the distribution of scores within gnomAD.

Code

import matplotlib.pyplot as plt
import plotly.express as px

# Extract the 'score' column from the merged DataFrame
scores = merged_df["score"]


fig = px.histogram(
    merged_df,
    x="score",
    nbins=20,
    opacity=1,
    labels={"score": "Scores"},
    title="Distribution of Scores Within Gnomad",
)

# Customize the layout
fig.update_layout(bargap=0.1)

# Show the plot
fig.show()

In their paper, the authors of the model found good separation at a threshold of -7.5:

Code

# Count the number of entries with a score below -15
count_below_minus_75 = (merged_df['score'] < -7.5).sum()

# Print the count
print(f'Number of entries with a score below -7.5: {count_below_minus_75}')

Number of entries with a score below -7.5: 531

For the missense variants in gnomAD, this gives a predicted 531 deleterious variants. This is far more numerous than the current number of known pathogenic variants, and only a few known missense variants can be used to compare against this prediction. The below two histograms will attempt to gage how useful this threshold may actually be in this scenario.

And again, this is only for missense mutations.

The below shows the three known/likely MISSENSE mutations that are deleterious

Code

# only got 3 that are missense, 23 conflicting. versus...
merged_df['ClinVar Clinical Significance'].value_counts()

ClinVar Clinical Significance
Uncertain significance                          236
Conflicting interpretations of pathogenicity     23
Likely benign                                    19
Benign                                            7
Benign/Likely benign                              4
Likely pathogenic                                 2
Pathogenic                                        1
Name: count, dtype: int64

And against all variants in ClinVar:

Code

# 29, with 66 conflicting
df['ClinVar Clinical Significance'].value_counts()

ClinVar Clinical Significance
Uncertain significance                          272
Likely benign                                   216
Conflicting interpretations of pathogenicity     66
Benign                                           34
Benign/Likely benign                             23
Pathogenic                                       16
Likely pathogenic                                10
Pathogenic/Likely pathogenic                      3
Name: count, dtype: int64

And against the known/likely pathogenic variants in ClinVar, and see their cause:

Code

all_known_path = df['ClinVar Clinical Significance'].isin(['Pathogenic', 'Likely pathogenic', 'Pathogenic/Likely pathogenic  '])
akp_df = df[all_known_path]
akp_df['VEP Annotation'].value_counts()

VEP Annotation
frameshift_variant         9
stop_gained                7
splice_donor_variant       5
missense_variant           3
splice_acceptor_variant    2
Name: count, dtype: int64

29 pathogenic/likely pathogenic in ClinVar – these other variants, the 26 besides the 3 missense, are frameshit, stop gained and splice effect variants. The ESM1b model is capable of also examining stop gained and indel variants, but requires more intense analysis.

The next couple cells define and show dataframes for the pathogenic and benign variants to enable marking the figures. The authors of ESM1b note that a score threshold of -7.5 tends to give good separation between delterious and benign mutations.

Pathogenic missense dataframe:

Code

# Filter the 'gnomad' DataFrame to select pathogenic variants
pathogenic_missense_variants = merged_df['ClinVar Clinical Significance'].isin(['Pathogenic', 'Likely pathogenic',]) # # adds 23 more samples that have varying scores

pmv_df = merged_df[pathogenic_missense_variants]
pmv_df = pmv_df.sort_values(by='score', ascending=True)
print("Only pathogenic, and likely pathogenic")
pmv_df[['variant','score']]

Only pathogenic, and likely pathogenic

	variant	score
356	C1229Y	-11.661
133	T1606I	-10.267
795	D536V	-8.792

Benign missense dataframe:

Code

# merged_df['ClinVar Clinical Significance'].value_counts() # peek the categories

# Filter, get benign bariants
benign_missense_variants = merged_df['ClinVar Clinical Significance'].isin(['Benign', 'Benign/Likely benign', 'Likely benign'])

# bmv: benign missense variants
bmv_df = merged_df[benign_missense_variants]
bmv_df = bmv_df.sort_values(by='score', ascending=True)
print("Only benign, and likely benign")
bmv_df[['variant','score']]

Only benign, and likely benign

	variant	score
888	I389N	-11.627
875	G407R	-10.531
1041	P173A	-9.097
496	Q1000P	-8.770
1135	V15F	-8.397
1072	T122A	-7.857
334	T1258A	-7.578
323	G1272D	-7.140
649	R774K	-6.877
367	P1200S	-6.744
549	L914I	-6.691
995	T231A	-6.417
395	E1158Q	-6.373
119	D1623A	-6.256
638	Y781F	-5.500
994	T231S	-5.457
555	F905C	-5.382
15	S1795L	-5.089
150	T1574S	-4.799
94	V1667I	-4.738
95	V1667L	-4.634
215	K1473R	-3.960
45	Q1749K	-3.652
8	R1808H	-2.664
11	T1802S	-2.647
239	E1414K	-2.449
1019	S200G	-1.885
523	R963K	-1.597
135	A1602V	-1.220
181	M1523I	-0.146

With the very limited sample size of 3, all known/likely pathogenic missense variants’ score is below -7.5. And known/likely benign tend to be above -7.5. However, limited data precludes any real inference here.

Create the feature first, the plotly express API does not accept conditional statements for coloring.

Code

# Define the threshold
threshold = -7.5

# Create a new feature 'below threshold' based on the 'score' column
merged_df['below threshold'] = merged_df['score'].apply(lambda x: "yes" if x < threshold else "no")

Code

# import matplotlib.pyplot as plt
import plotly.express as px

fig = px.histogram(
    merged_df,
    x="score",
    color = "below threshold",
    nbins=20,
    opacity=1,
    labels={"score": "Scores"},
    title="Distribution of Scores Within Gnomad",
)

# Add markers or lines for known pathogenic variants
for idx, row in pmv_df.iterrows():
    fig.add_shape(type='line', x0=row['score'], x1=row['score'], y0=0, y1=1, opacity = 0.4,
                  xref='x', yref='paper', line=dict(color='red'))

# Add markers/lines for benign variants
for idx, row in bmv_df.iterrows():
    fig.add_shape(type='line', x0=row['score'], x1=row['score'], y0=0, y1=1, opacity = 0.4,
                  xref='x', yref='paper', line=dict(color='green'))

# Customize the layout
fig.update_layout(bargap=0.1)

# Show the plot
fig.show()

This is the most informative figure in this preliminary analysis. We see that, yes, the benign tend to be greater than -7.5; but there are more benign samples below the -7.5 threshold than there are pathogenic samples, and overall we have very limited data here to work with. The threshold can’t be proven in our circumstance to be useful. There is at least a trend for benign variants to be above the threshold.

But we do have recorded variants at the lowest score that we could start examining first, and at least identify some new variants that can be added. These can be modeled in AlphaFold and then examined manually, and put through the simulation pipeline.

And so, for the sake of seeing what this most deleterious variant can demonstrate if modeled in AlphaFold… let’s get the lowest scored missense mutation is.

Code

print("Lowest score:", merged_df.score.min())
merged_df[['variant','score']].tail()

Lowest score: -16.797

	variant	score
1136	L13P	-16.797
1137	S12P	-8.858
1138	F6C	-7.303
1139	R3I	-6.888
1140	A2E	-6.332

If we look for this mutation in P14410 in UniProt, https://www.ebi.ac.uk/ProtVar/query?search=P14410%20L13P it’s also predicted deleterious by CADD, a model from 2021 https://cadd.gs.washington.edu/info. We can also cross reference with this model in future, but ESM1b in its paper demonstrates superiority.

References

Weijers, H. A., va de KAMER, J. H., Mossel, D. A. & Dicke, W. K. Diarrhoea caused by deficiency of sugar-splitting enzymes. Lancet (London, England) 2, 296–297 (1960).

Welsh, J. D., Poley, J. R., Bhatia, M. & Stevenson, D. E. Intestinal disaccharidase activities in relation to age, race, and mucosal damage. Gastroenterology 75, 847–855 (1978).

Ellestad-Sayed, J., Haworth, J. & Hildes, J. Disaccharide malabsorption and dietary patterns in two Canadian Eskimo Communities1234. The American Journal of Clinical Nutrition 31, 1473–1478 (1978).

Gudmand-Høyer, E., Fenger, H. J., Kern-Hansen, P. & Madsen, P. R. Sucrase deficiency in Greenland. Incidence and genetic aspects. Scandinavian Journal of Gastroenterology 22, 24–28 (1987).

Henström, M. et al. Functional variants in the sucraseisomaltase gene associate with increased risk of irritable bowel syndrome. Gut 67, 263–270 (2018).

Garcia-Etxebarria, K. et al. Increased Prevalence of Rare Sucrase-isomaltase Pathogenic Variants in Irritable Bowel Syndrome Patients. Clinical Gastroenterology and Hepatology 16, 1673–1676 (2018).

Kim, S. B., Calmet, F. H., Garrido, J., Garcia-Buitrago, M. T. & Moshiree, B. Sucrase-Isomaltase Deficiency as a Potential Masquerader in Irritable Bowel Syndrome. Digestive Diseases and Sciences 65, 534–540 (2020).

Husein, D. M., Wanes, D., Marten, L. M., Zimmer, K.-P. & Naim, H. Y. Heterozygotes Are a Potential New Entity among Homozygotes and Compound Heterozygotes in Congenital Sucrase-Isomaltase Deficiency. Nutrients 11, 2290 (2019).

Haberman, Y. et al. Congenital Sucrase-isomaltase Deficiency: A Novel Compound Heterozygous Mutation Causing Aberrant Protein Localization. Journal of pediatric gastroenterology and nutrition 64, 770–776 (2017).

10.

Taskin, D. G. et al. Prevalence of congenital sucrase-isomaltase deficiency in Turkey may be much higher than the estimates. Journal of Genetics 102, 31 (2023).

11.

Chen, S. et al. A genome-wide mutational constraint map quantified from variation in 76,156 human genomes. 2022.03.20.485034 (2022) doi:10.1101/2022.03.20.485034.

12.

Landrum, M. J. et al. ClinVar: Public archive of interpretations of clinically relevant variants. Nucleic Acids Research 44, D862–D868 (2016).

13.

De Leusse, C., Roman, C., Roquelaure, B. & Fabre, A. Estimating the prevalence of congenital disaccharidase deficiencies using allele frequencies from gnomAD. Archives de Pédiatrie 29, 599–603 (2022).

14.

McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biology 17, 122 (2016).

15.

Brandes, N., Goldman, G., Wang, C. H., Ye, C. J. & Ntranos, V. Genome-wide prediction of disease variant effects with a deep protein language model. Nature Genetics (2023) doi:10.1038/s41588-023-01465-0.

16.

Kurosawa, R. et al. PDIVAS: Pathogenicity predictor for Deep-Intronic Variants causing Aberrant Splicing. 2023.03.20.23287464 (2023) doi:10.1101/2023.03.20.23287464.

17.

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

18.

Corso, G., Stärk, H., Jing, B., Barzilay, R. & Jaakkola, T. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. (2023) doi:10.48550/arXiv.2210.01776.

19.

Ketata, M. A. et al. DiffDock-PP: Rigid Protein-Protein Docking with Diffusion Models. (2023).

20.

Brocidiacono, M., Popov, K. I., Koes, D. R. & Tropsha, A. PLANTAIN: Diffusion-inspired Pose Score Minimization for Fast and Accurate Molecular Docking. (2023).

21.

Trott, O. & Olson, A. J. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading. Journal of computational chemistry 31, 455–461 (2010).

22.

Santos-Martins, D. et al. Accelerating AutoDock4 with GPUs and Gradient-Based Local Search. Journal of Chemical Theory and Computation 17, 1060–1073 (2021).

23.

Roberts, J. B. et al. Foldy: A web application for interactive protein structure analysis. (2023) doi:10.1101/2023.05.11.540333.

24.

Wong, F. et al. Benchmarking AlphaFold-enabled molecular docking predictions for antibiotic discovery. Molecular Systems Biology 18, e11081 (2022).