seeks to understand protein sequence space in the light of evolution and erroneous protein production with focus on intrinsically disordered and condensate-forming proteins. We complement the increasing amount of high-throughput data with rigorous computational studies, and provide hypotheses for experimental validation.
Assessing the phenotypic effects of single nucleotide variants remains a challenge even in genes with well-established clinical impact. We have developed DeMAG (Deciphering Mutations in Actionable Genes, demag.org) that reaches unprecedented accuracy in the classification of missense variants in clinically actionable genes. Our approach makes use of protein 3D structures and epistatic residue interactions to derive new predictive features, as well as abundant clinical diagnostic data in these genes to improve performance.
Transcriptional and translational errors generate diverse sequences, most dysfunctional, some harboring novel functions (Romero et. al, Prot Sci. 2022).
To detect and quantify phenotypic mutations proteome-wide, we combine theoretical modeling, machine learning and experiments. We developed a theoretical model and mass spectrometry pipeline to study amino acid misincoroprations at proteome scale (Landerer, bioRxiv 2022).
We are developing algorithms for predicting frameshifts by using proteomics data. We aim at discovering novel frameshift and STOP codon read-through variants, and we will explore the evolutionary potential of slippage sites and test whether they could lead to promiscuous protein functions.
Most of our understanding about the molecular history of the cell is biased towards well-conserved ordered protein domains. Intrinsically disordered protein regions (IDRs) are largely unexplored due to their low sequence complexity and low conservation. Yet they have pivotal roles in the cell including the formation of biomolecular condensates.
We have designed an alignment-free algorithm to assess homology between un-ulignable IDR sequences (SHARK-dive) and developed a method to identify conserved motifs within a set of homologous IDR sequences (SHARK-capture). Based on evolutionary information, we can classify IDR families using our new tools, and propose functional sites that we are validating experimentally.
To integrate interdisciplinary scientific knowledge about the function and composition of biomolecular condensates, we developed the CrowDsourcing COndensate Database and Encyclopedia (CD-CODE.org). CD-CODE is a community-editable platform, which includes a database of biomolecular condensates based on the literature, an encyclopedia of relevant scientific terms, and a crowdsourcing web application.
Based on CD-CODE data, we are developing a sequence-based predictor of proteins involved in condensates.