* joint first author # joint corresponding author

Cedric Landerer, Maxim Scheremetjew, HongKee Moon, Lena Hersemann, Agnes Toth-Petroczy
deTELpy: Python package for high-throughput detection of amino acid substitutions in mass spectrometry datasets.
Bioinformatics, Art. No. doi: 10.1093/bioinformatics/btae424 (2024)
Open Access DOI
Errors in the processing of genetic information during protein synthesis can lead to phenotypic mutations, such as amino acid substitutions, e.g. by transcription or translation errors. While genetic mutations can be readily identified using DNA sequencing, and mutations due to transcription errors by RNA sequencing, translation errors can only be identified proteome-wide using mass spectrometry.

Maria Luisa Romero Romero#, Jonas Poehls, Anastasiia Kirilenko, Doris Richter, Tobias Jumel, Anna Shevchenko, Agnes Toth-Petroczy#
Environment modulates protein heterogeneity through transcriptional and translational stop codon readthrough.
Nat Commun, 15(1) Art. No. 4446 (2024)
Open Access DOI
Stop codon readthrough events give rise to longer proteins, which may alter the protein's function, thereby generating short-lasting phenotypic variability from a single gene. In order to systematically assess the frequency and origin of stop codon readthrough events, we designed a library of reporters. We introduced premature stop codons into mScarlet, which enabled high-throughput quantification of protein synthesis termination errors in E. coli using fluorescent microscopy. We found that under stress conditions, stop codon readthrough may occur at rates as high as 80%, depending on the nucleotide context, suggesting that evolution frequently samples stop codon readthrough events. The analysis of selected reporters by mass spectrometry and RNA-seq showed that not only translation but also transcription errors contribute to stop codon readthrough. The RNA polymerase was more likely to misincorporate a nucleotide at premature stop codons. Proteome-wide detection of stop codon readthrough by mass spectrometry revealed that temperature regulated the expression of cryptic sequences generated by stop codon readthrough in E. coli. Overall, our findings suggest that the environment affects the accuracy of protein production, which increases protein heterogeneity when the organisms need to adapt to new conditions.

Pavan K Bendapudi, Sumaiya Nazeen, Justine Ryu, Onuralp Söylemez, Alissa K Robbins, Betty Rouaisnel, Jillian K O'Neil, Ruchika Pokhriyal, Moua Yang, Meaghan Colling, Bryce Pasko, Michael Bouzinier, Lindsay Tomczak, Lindsay Collier, David A Barrios, Sanjay Ram, Agnes Toth-Petroczy, Joel B Krier, Elizabeth Fieg, Walter 'Sunny' Dzik, James Hudspeth, Olga Pozdnyakova, Valentina Nardi, James R Knight, Richard L Maas, Shamil Sunyaev, Julie-Aurore Losman
Low-frequency inherited complement receptor variants are associated with purpura fulminans.
Blood, 143(11) 1032-1044 (2024)
Extreme disease phenotypes can provide key insights into the pathophysiology of common conditions, but studying such cases is challenging due to their rarity and the limited statistical power of existing methods. Herein, we used a novel approach to pathway-based mutational burden testing, the rare variant trend test (RVTT), to investigate genetic risk factors for an extreme form of sepsis-induced coagulopathy, infectious purpura fulminans (PF). In addition to prospective patient sample collection, we electronically screened over 10.4 million medical records from 4 large hospital systems and identified historical cases of PF for which archived specimens were available to perform germline whole-exome sequencing. We found a significantly increased burden of low-frequency, putatively function-altering variants in the complement system in patients with PF compared with unselected patients with sepsis (P = .01). A multivariable logistic regression analysis found that the number of complement system variants per patient was independently associated with PF after controlling for age, sex, and disease acuity (P = .01). Functional characterization of PF-associated variants in the immunomodulatory complement receptors CR3 and CR4 revealed that they result in partial or complete loss of anti-inflammatory CR3 function and/or gain of proinflammatory CR4 function. Taken together, these findings suggest that inherited defects in CR3 and CR4 predispose to the maladaptive hyperinflammation that characterizes severe sepsis with coagulopathy.

Cedric Landerer, Jonas Poehls, Agnes Toth-Petroczy
Fitness Effects of Phenotypic Mutations at Proteome-Scale Reveal Optimality of Translation Machinery.
Mol Biol Evol, 41(3) Art. No. msae048 (2024)
Open Access DOI
Errors in protein translation can lead to non-genetic, phenotypic mutations, including amino acid misincorporations. While phenotypic mutations can increase protein diversity, the systematic characterization of their proteome-wide frequencies and their evolutionary impact has been lacking. Here, we developed a mechanistic model of translation errors to investigate how selection acts on protein populations produced by amino acid misincorporations. We fitted the model to empirical observations of misincorporations obtained from over a hundred mass spectrometry datasets of E. coli and S. cerevisiae. We found that on average 20% to 23% of proteins synthesized in the cell are expected to harbor at least one amino acid misincorporation, and that deleterious misincorporations are less likely to occur. Combining misincorporation probabilities and the estimated fitness effects of amino acid substitutions in a population genetics framework, we found 74% of mistranslation events in E. coli and 94% in S. cerevisiae to be neutral. We further show that the set of available synonymous tRNAs is subject to evolutionary pressure, as the presence of missing tRNAs would increase codon-anticodon cross-reactivity and misincorporation error rates. Overall, we find that the translation machinery is likely optimal in E. coli and S. cerevisiae and that both local solutions at the level of codons and a global solution such as the tRNA pool can mitigate the impact of translation errors. We provide a framework to study the evolutionary impact of codon-specific translation errors and a method for their proteome-wide detection across organisms and conditions.

M. Luisa Romero-Romero, H Garcia-Seisdedos
Agglomeration: when folded proteins clump together.
Biophys Rev, 15(6) 1987-2003 (2023)
Open Access DOI
Protein self-association is a widespread phenomenon that results in the formation of multimeric protein structures with critical roles in cellular processes. Protein self-association can lead to finite protein complexes or open-ended, and potentially, infinite structures. This review explores the concept of protein agglomeration, a process that results from the infinite self-assembly of folded proteins. We highlight its differences from other better-described processes with similar macroscopic features, such as aggregation and liquid-liquid phase separation. We review the sequence, structural, and biophysical factors influencing protein agglomeration. Lastly, we briefly discuss the implications of agglomeration in evolution, disease, and aging. Overall, this review highlights the need to study protein agglomeration for a better understanding of cellular processes.

Jan Simon Schuhmacher, Susanne Tom Dieck, Savvas Christoforidis, Cedric Landerer, Jimena Davila Gallesio, Lena Hersemann, Sarah Seifert, Ramona Schäfer, Angelika Giner, Agnes Toth-Petroczy, Yannis Kalaidzidis, Katherine E Bohnsack, Markus T Bohnsack, Erin M Schuman, Marino Zerial
The Rab5 effector FERRY links early endosomes with mRNA localization.
Mol Cell, 83(11) 1839-1855 (2023)
Open Access DOI
Localized translation is vital to polarized cells and requires precise and robust distribution of different mRNAs and ribosomes across the cell. However, the underlying molecular mechanisms are poorly understood and important players are lacking. Here, we discovered a Rab5 effector, the five-subunit endosomal Rab5 and RNA/ribosome intermediary (FERRY) complex, that recruits mRNAs and ribosomes to early endosomes through direct mRNA-interaction. FERRY displays preferential binding to certain groups of transcripts, including mRNAs encoding mitochondrial proteins. Deletion of FERRY subunits reduces the endosomal localization of transcripts in cells and has a significant impact on mRNA levels. Clinical studies show that genetic disruption of FERRY causes severe brain damage. We found that, in neurons, FERRY co-localizes with mRNA on early endosomes, and mRNA loaded FERRY-positive endosomes are in close proximity of mitochondria. FERRY thus transforms endosomes into mRNA carriers and plays a key role in regulating mRNA distribution and transport.

Nadia Rostam, Soumyadeep Ghosh, Chi Fung Willis Chow, Anna Hadarovich, Cedric Landerer, Rajat Ghosh, HongKee Moon, Lena Hersemann, Diana M Mitrea, Isaac A Klein, Anthony Hyman, Agnes Toth-Petroczy
CD-CODE: crowdsourcing condensate database and encyclopedia.
Nat Methods, 20(5) 673-676 (2023)
Open Access DOI
The discovery of biomolecular condensates transformed our understanding of intracellular compartmentalization of molecules. To integrate interdisciplinary scientific knowledge about the function and composition of biomolecular condensates, we developed the crowdsourcing condensate database and encyclopedia ( ). CD-CODE is a community-editable platform, which includes a database of biomolecular condensates based on the literature, an encyclopedia of relevant scientific terms and a crowdsourcing web application. Our platform will accelerate the discovery and validation of biomolecular condensates, and facilitate efforts to understand their role in disease and as therapeutic targets.

Federica Luppino, Ivan Adzhubei, Christopher Cassa#, Agnes Toth-Petroczy#
DeMAG predicts the effects of variants in clinically actionable genes by integrating structural and evolutionary epistatic features.
Nat Commun, 14(1) Art. No. 2230 (2023)
Open Access DOI
Despite the increasing use of genomic sequencing in clinical practice, the interpretation of rare genetic variants remains challenging even in well-studied disease genes, resulting in many patients with Variants of Uncertain Significance (VUSs). Computational Variant Effect Predictors (VEPs) provide valuable evidence in variant assessment, but they are prone to misclassifying benign variants, contributing to false positives. Here, we develop Deciphering Mutations in Actionable Genes (DeMAG), a supervised classifier for missense variants trained using extensive diagnostic data available in 59 actionable disease genes (American College of Medical Genetics and Genomics Secondary Findings v2.0, ACMG SF v2.0). DeMAG improves performance over existing VEPs by reaching balanced specificity (82%) and sensitivity (94%) on clinical data, and includes a novel epistatic feature, the 'partners score', which leverages evolutionary and structural partnerships of residues. The 'partners score' provides a general framework for modeling epistatic interactions, integrating both clinical and functional information. We provide our tool and predictions for all missense variants in 316 clinically actionable disease genes ( to facilitate the interpretation of variants and improve clinical decision-making.

Maria Luisa Romero Romero, Cedric Landerer, Jonas Poehls, Agnes Toth-Petroczy
Phenotypic mutations contribute to protein diversity and shape protein evolution.
Protein Sci, 31(9) Art. No. e4397 (2022)
Open Access DOI
Errors in DNA replication generate genetic mutations, while errors in transcription and translation lead to phenotypic mutations. Phenotypic mutations are orders of magnitude more frequent than genetic ones, yet they are less understood. Here, we review the types of phenotypic mutations, their quantifications, and their role in protein evolution and disease. The diversity generated by phenotypic mutation can facilitate adaptive evolution. Indeed, phenotypic mutations, such as ribosomal frameshift and stop codon readthrough, sometimes serve to regulate protein expression and function. Phenotypic mutations have often been linked to fitness decrease and diseases. Thus, understanding the protein heterogeneity and phenotypic diversity caused by phenotypic mutations will advance our understanding of protein evolution and have implications on human health and diseases.

Colin Jackson#, Agnes Toth-Petroczy, Rachel Kolodny, Florian Hollfelder, Monika Fuxreiter, Shina Caroline Lynn Kamerlin#, Nobuhiko Tokuriki#
Adventures on the Routes of Protein Evolution-In Memoriam Dan Salah Tawfik (1955-2021).
J Mol Biol, 434(7) Art. No. 167462 (2022)
Understanding how proteins evolved not only resolves mysteries of the past, but also helps address challenges of the future, particularly those relating to the design and engineering of new protein functions. Here we review the work of Dan S. Tawfik, one of the pioneers of this area, highlighting his seminal contributions in diverse fields such as protein design, high throughput screening, protein stability, fundamental enzyme-catalyzed reactions and promiscuity, that underpin biology and the origins of life. We discuss the influence of his work on how our models of enzyme and protein function have developed and how the main driving forces of molecular evolution were elucidated. The discovery of the rugged routes of evolution has enabled many practical applications, some which are now widely used.

Belin Selcen Beydag-Tasöz, Joyson Verner D'Costa, Lena Hersemann, Federica Luppino, Yung Hae Kim, Christoph Zechner, Anne Grapin-Botton
A combined transcriptional and dynamic roadmap of single human pancreatic endocrine progenitors reveals proliferative capacity and differentiation continuum.
bioRxiv, Art. No. (2021)
Open Access DOI
Basic helix-loop-helix genes, particularly proneural genes, are well-described triggers of cell differentiation, yet limited information exists on their dynamics, notably in human development. Here, we focus on Neurogenin 3 (NEUROG3), which is crucial for pancreatic endocrine lineage initiation. Using a double reporter to monitor endogenous NEUROG3 transcription and protein expression in single cells in 2D and 3D models of human pancreas development, we show peaks of expression for the RNA and protein at 22 and 11 hours respectively, approximately two-fold slower than in mice, and remarkable heterogeneity in peak expression levels all triggering differentiation. We also reveal that some human endocrine progenitors proliferate once, mainly at the onset of differentiation, rather than forming a subpopulation with sustained proliferation. Using reporter index-sorted single-cell RNA-seq data, we statistically map transcriptome to dynamic behaviors of cells in live imaging and uncover transcriptional states associated with variations in motility as NEUROG3 levels change, a method applicable to other contexts.

Jodie Ouahed✳︎, Judith R Kelsen✳︎, Waldo A Spessott, Kameron Kooshesh, Maria L Sanmillan, Noor Dawany, Kathleen E Sullivan, Kathryn Hamilton, Voytek Slowik, Sergey Nejentsev, João Farela Neves, Helena Flores, Wendy K Chung, Ashley Wilson, Kwame Anyane-Yeboa, Karen Wou, Preti Jain, Michael Field, Sophia Tollefson, Maiah H Dent, Dalin Li, Takeo Naito, Dermot P B McGovern, Andrew C Kwong, Faith Taliaferro, Jose Ordovas-Montanes, Bruce Horwitz, Daniel Kotlarz, Christoph Klein, Jonathan Evans, Jill Dorsey, Neil Warner, Abdul Elkadri, Aleixo M Muise, Jeffrey Goldsmith, Benjamin Thompson, Karin R Engelhardt, Andrew J Cant, Sophie Hambleton, Andrew Barclay, Agnes Toth-Petroczy, Dana Vuzman, Nikkola Carmichael, Corneliu Bodea, Christopher Cassa, Marcella Devoto, Richard L Maas, Edward M Behrens#, Claudio G Giraudo#, Scott B Snapper
Variants in STXBP3 are Associated with Very Early Onset Inflammatory Bowel Disease, Bilateral Sensorineural Hearing Loss and Immune Dysregulation.
J Crohns Colitis, 15(11) 1908-1919 (2021)
Very early onset inflammatory bowel disease [VEOIBD] is characterized by intestinal inflammation affecting infants and children less than 6 years of age. To date, over 60 monogenic aetiologies of VEOIBD have been identified, many characterized by highly penetrant recessive or dominant variants in underlying immune and/or epithelial pathways. We sought to identify the genetic cause of VEOIBD in a subset of patients with a unique clinical presentation.

Anwoy Kumar Mohanty, Dana Vuzman, Laurent Francioli, Christopher Cassa, Agnes Toth-Petroczy, Shamil Sunyaev
novoCaller: a Bayesian network approach for de novo variant calling from pedigree and population sequence data.
Bioinformatics, 35(7) 1174-1180 (2019)
De novo mutations (i.e. newly occurring mutations) are a pre-dominant cause of sporadic dominant monogenic diseases and play a significant role in the genetics of complex disorders. De novo mutation studies also inform population genetics models and shed light on the biology of DNA replication and repair. Despite the broad interest, there is room for improvement with regard to the accuracy of de novo mutation calling.

Jose Velilla✳︎, Michael Mario Marchetti✳︎, Agnes Toth-Petroczy, Claire Grosgogeat, Alexis H Bennett, Nikkola Carmichael, Elicia Estrella, Basil T Darras, Natasha Y Frank, Joel B Krier, Rachelle Gaudet, Vandana A Gupta
Homozygous TRPV4 mutation causes congenital distal spinal muscular atrophy and arthrogryposis.
Neurol Genet, 5(2) Art. No. e312 (2019)
Open Access DOI
To identify the genetic cause of disease in a form of congenital spinal muscular atrophy and arthrogryposis (CSMAA).

Mirna Bilus, Maja Semanjski, Marko Mocibob, Igor Zivkovic, Nevena Cvetesic, Dan S Tawfik, Agnes Toth-Petroczy, Boris Macek, Ita Gruic-Sovulj
On the Mechanism and Origin of Isoleucyl-tRNA Synthetase Editing against Norvaline.
J Mol Biol, 431(6) 1284-1297 (2019)
Aminoacyl-tRNA synthetases (aaRSs), the enzymes responsible for coupling tRNAs to their cognate amino acids, minimize translational errors by intrinsic hydrolytic editing. Here, we compared norvaline (Nva), a linear amino acid not coded for protein synthesis, to the proteinogenic, branched valine (Val) in their propensity to mistranslate isoleucine (Ile) in proteins. We show that in the synthetic site of isoleucyl-tRNA synthetase (IleRS), Nva and Val are activated and transferred to tRNA at similar rates. The efficiency of the synthetic site in pre-transfer editing of Nva and Val also appears to be similar. Post-transfer editing was, however, more rapid with Nva and consequently IleRS misaminoacylates Nva-tRNAIle at slower rate than Val-tRNAIle. Accordingly, an Escherichia coli strain lacking IleRS post-transfer editing misincorporated Nva and Val in the proteome to a similar extent and at the same Ile positions. However, Nva mistranslation inflicted higher toxicity than Val, in agreement with IleRS editing being optimized for hydrolysis of Nva-tRNAIle. Furthermore, we found that the evolutionary-related IleRS, leucyl- and valyl-tRNA synthetases (I/L/VRSs), all efficiently hydrolyze Nva-tRNAs even when editing of Nva seems redundant. We thus hypothesize that editing of Nva-tRNAs had already existed in the last common ancestor of I/L/VRSs, and that the editing domain of I/L/VRSs had primarily evolved to prevent infiltration of Nva into modern proteins.

Maria Luisa Romero Romero, Fan Yang, Yu-Ru Lin, Agnes Toth-Petroczy, Igor N. Berezovsky, Alexander Goncearenco, Wen Yang, Alon Wellner, Fanindra Kumar-Deshmukh, Michal Sharon, David Baker, Gabriele Varani, Dan S Tawfik
Simple yet functional phosphate-loop proteins.
Proc Natl Acad Sci U.S.A., 115(51) 11943-11950 (2018)
Abundant and essential motifs, such as phosphate-binding loops (P-loops), are presumed to be the seeds of modern enzymes. The Walker-A P-loop is absolutely essential in modern NTPase enzymes, in mediating binding, and transfer of the terminal phosphate groups of NTPs. However, NTPase function depends on many additional active-site residues placed throughout the protein's scaffold. Can motifs such as P-loops confer function in a simpler context? We applied a phylogenetic analysis that yielded a sequence logo of the putative ancestral Walker-A P-loop element: a β-strand connected to an α-helix via the P-loop. Computational design incorporated this element into de novo designed β-α repeat proteins with relatively few sequence modifications. We obtained soluble, stable proteins that unlike modern P-loop NTPases bound ATP in a magnesium-independent manner. Foremost, these simple P-loop proteins avidly bound polynucleotides, RNA, and single-strand DNA, and mutations in the P-loop's key residues abolished binding. Binding appears to be facilitated by the structural plasticity of these proteins, including quaternary structure polymorphism that promotes a combined action of multiple P-loops. Accordingly, oligomerization enabled a 55-aa protein carrying a single P-loop to confer avid polynucleotide binding. Overall, our results show that the P-loop Walker-A motif can be implemented in small and simple β-α repeat proteins, primarily as a polynucleotide binding motif.

Thomas A Hopf, Anna G Green, Benjamin Schubert, Sophia Mersmann, Charlotta P I Schärfe, John Ingraham, Agnes Toth-Petroczy, Kelly Brock, Adam J Riesselman, Perry Palmedo, ChulHee Kang, Robert Sheridan, Eli J Draizen, Christian Dallago, Chris Sander, Debora S Marks
The EVcouplings Python framework for coevolutionary sequence analysis.
Bioinformatics, Art. No. doi: 10.1093/bioinformatics/bty862 (2018)
Coevolutionary sequence analysis has become a commonly used technique for de novo prediction of the structure and function of proteins, RNA, and protein complexes. We present the EVcouplings framework, a fully integrated open-source application and Python package for coevolutionary analysis. The framework enables generation of sequence alignments, calculation and evaluation of evolutionary couplings (ECs), and de novo prediction of structure and mutation effects. The combination of an easy to use, flexible command line interface and an underlying modular Python package makes the full power of coevolutionary analyses available to entry-level and advanced users.

Alireza Haghighi, Joel B Krier, Agnes Toth-Petroczy, Christopher Cassa, Natasha Y Frank, Nikkola Carmichael, Elizabeth Fieg, Andrew Bjonnes, Anwoy Kumar Mohanty, Lauren C Briere, Sharyn Lincoln, Stephanie Lucia, Vandana A Gupta, Onuralp Söylemez, Sheila Sutti, Kameron Kooshesh, Haiyan Qiu, Christopher J Fay, Victoria Perroni, Jamie Valerius, Meredith Hanna, Alexander Frank, Jodie Ouahed, Scott B Snapper, Angeliki Pantazi, Sameer S Chopra, Ignaty Leshchiner, Nathan O Stitziel, Anna Feldweg, Michael Mannstadt, Joseph Loscalzo, David A Sweetser, Eric Liao, Joan M Stoler, Catherine B Nowak, Pedro A Sanchez-Lara, Ophir D. Klein, Hazel Perry, Nikolaos A Patsopoulos, Soumya Raychaudhuri, Wolfram Goessling, Robert C Green, Christine E Seidman, Calum A MacRae, Shamil Sunyaev, Richard L Maas, Dana Vuzman, Dana null
An integrated clinical program and crowdsourcing strategy for genomic sequencing and Mendelian disease gene discovery.
NPJ Genom Med, 3 21-21 (2018)
Despite major progress in defining the genetic basis of Mendelian disorders, the molecular etiology of many cases remains unknown. Patients with these undiagnosed disorders often have complex presentations and require treatment by multiple health care specialists. Here, we describe an integrated clinical diagnostic and research program using whole-exome and whole-genome sequencing (WES/WGS) for Mendelian disease gene discovery. This program employs specific case ascertainment parameters, a WES/WGS computational analysis pipeline that is optimized for Mendelian disease gene discovery with variant callers tuned to specific inheritance modes, an interdisciplinary crowdsourcing strategy for genomic sequence analysis, matchmaking for additional cases, and integration of the findings regarding gene causality with the clinical management plan. The interdisciplinary gene discovery team includes clinical, computational, and experimental biomedical specialists who interact to identify the genetic etiology of the disease, and when so warranted, to devise improved or novel treatments for affected patients. This program effectively integrates the clinical and research missions of an academic medical center and affords both diagnostic and therapeutic options for patients suffering from genetic disease. It may therefore be germane to other academic medical institutions engaged in implementing genomic medicine programs.

Agnes Toth-Petroczy, Perry Palmedo, John Ingraham, Thomas A Hopf, Bonnie Berger, Chris Sander, Debora S Marks
Structured States of Disordered Proteins from Genomic Sequences.
Cell, 167(1) 158-170 (2016)
Protein flexibility ranges from simple hinge movements to functional disorder. Around half of all human proteins contain apparently disordered regions with little 3D or functional information, and many of these proteins are associated with disease. Building on the evolutionary couplings approach previously successful in predicting 3D states of ordered proteins and RNA, we developed a method to predict the potential for ordered states for all apparently disordered proteins with sufficiently rich evolutionary information. The approach is highly accurate (79%) for residue interactions as tested in more than 60 known disordered regions captured in a bound or specific condition. Assessing the potential for structure of more than 1,000 apparently disordered regions of human proteins reveals a continuum of structural order with at least 50% with clear propensity for three- or two-dimensional states. Co-evolutionary constraints reveal hitherto unseen structures of functional importance in apparently disordered proteins.

Paola Laurino, Ágnes Tóth-Petróczy, Rubén Meana-Pañeda, Wei Lin, Donald G Truhlar, Dan S Tawfik
An Ancient Fingerprint Indicates the Common Ancestry of Rossmann-Fold Enzymes Utilizing Different Ribose-Based Cofactors.
PLoS Biol, 14(3) Art. No. e1002396 (2016)
Open Access DOI
Nucleoside-based cofactors are presumed to have preceded proteins. The Rossmann fold is one of the most ancient and functionally diverse protein folds, and most Rossmann enzymes utilize nucleoside-based cofactors. We analyzed an omnipresent Rossmann ribose-binding interaction: a carboxylate side chain at the tip of the second β-strand (β2-Asp/Glu). We identified a canonical motif, defined by the β2-topology and unique geometry. The latter relates to the interaction being bidentate (both ribose hydroxyls interacting with the carboxylate oxygens), to the angle between the carboxylate and the ribose, and to the ribose's ring configuration. We found that this canonical motif exhibits hallmarks of divergence rather than convergence. It is uniquely found in Rossmann enzymes that use different cofactors, primarily SAM (S-adenosyl methionine), NAD (nicotinamide adenine dinucleotide), and FAD (flavin adenine dinucleotide). Ribose-carboxylate bidentate interactions in other folds are not only rare but also have a different topology and geometry. We further show that the canonical geometry is not dictated by a physical constraint--geometries found in noncanonical interactions have similar calculated bond energies. Overall, these data indicate the divergence of several major Rossmann-fold enzyme classes, with different cofactors and catalytic chemistries, from a common pre-LUCA (last universal common ancestor) ancestor that possessed the β2-Asp/Glu motif.

Liat Rockah-Shmuel, Ágnes Tóth-Petróczy, Dan S Tawfik
Systematic Mapping of Protein Mutational Space by Prolonged Drift Reveals the Deleterious Effects of Seemingly Neutral Mutations.
PLoS Comput Biol, 11(8) Art. No. e1004421 (2015)
Open Access DOI
Systematic mappings of the effects of protein mutations are becoming increasingly popular. Unexpectedly, these experiments often find that proteins are tolerant to most amino acid substitutions, including substitutions in positions that are highly conserved in nature. To obtain a more realistic distribution of the effects of protein mutations, we applied a laboratory drift comprising 17 rounds of random mutagenesis and selection of M.HaeIII, a DNA methyltransferase. During this drift, multiple mutations gradually accumulated. Deep sequencing of the drifted gene ensembles allowed determination of the relative effects of all possible single nucleotide mutations. Despite being averaged across many different genetic backgrounds, about 67% of all nonsynonymous, missense mutations were evidently deleterious, and an additional 16% were likely to be deleterious. In the early generations, the frequency of most deleterious mutations remained high. However, by the 17th generation, their frequency was consistently reduced, and those remaining were accepted alongside compensatory mutations. The tolerance to mutations measured in this laboratory drift correlated with sequence exchanges seen in M.HaeIII's natural orthologs. The biophysical constraints dictating purging in nature and in this laboratory drift also seemed to overlap. Our experiment therefore provides an improved method for measuring the effects of protein mutations that more closely replicates the natural evolutionary forces, and thereby a more realistic view of the mutational space of proteins.

Monika Fuxreiter, Ágnes Tóth-Petróczy, Daniel A Kraut, Andreas Matouschek, Roderick Y H Lim, Bin Xue, Lukasz Kurgan, Vladimir N Uversky
Disordered proteinaceous machines.
Chem. Rev., 114(13) 6806-6843 (2014)

Agnes Tóth-Petróczy, Dan S Tawfik
Hopeful (protein InDel) monsters?
Structure, 22(6) 803-804 (2014)
In this issue of Structure, Arpino and colleagues describe in atomic detail how a protein stomachs a deletion within a helix, an event that rarely occurs in nature or in the lab. Can insertions and deletions (InDels) trigger dramatic structural transitions?

Agnes Tóth-Petróczy, Dan S Tawfik
The robustness and innovability of protein folds.
Curr Opin Struct Biol, 26 131-138 (2014)
Assignment of protein folds to functions indicates that >60% of folds carry out one or two enzymatic functions, while few folds, for example, the TIM-barrel and Rossmann folds, exhibit hundreds. Are there structural features that make a fold amenable to functional innovation (innovability)? Do these features relate to robustness--the ability to readily accumulate sequence changes? We discuss several hypotheses regarding the relationship between the architecture of a protein and its evolutionary potential. We describe how, in a seemingly paradoxical manner, opposite properties, such as high stability and rigidity versus conformational plasticity and structural order versus disorder, promote robustness and/or innovability. We hypothesize that polarity--differentiation and low connectivity between a protein's scaffold and its active-site--is a key prerequisite for innovability.

Liat Rockah-Shmuel, Ágnes Tóth-Petróczy, Asaf Sela, Omri Wurtzel, Rotem Sorek, Dan S Tawfik
Correlated occurrence and bypass of frame-shifting insertion-deletions (InDels) to give functional proteins.
PLoS Genet, 9(10) Art. No. e1003882 (2013)
Open Access DOI
Short insertions and deletions (InDels) comprise an important part of the natural mutational repertoire. InDels are, however, highly deleterious, primarily because two-thirds result in frame-shifts. Bypass through slippage over homonucleotide repeats by transcriptional and/or translational infidelity is known to occur sporadically. However, the overall frequency of bypass and its relation to sequence composition remain unclear. Intriguingly, the occurrence of InDels and the bypass of frame-shifts are mechanistically related - occurring through slippage over repeats by DNA or RNA polymerases, or by the ribosome, respectively. Here, we show that the frequency of frame-shifting InDels, and the frequency by which they are bypassed to give full-length, functional proteins, are indeed highly correlated. Using a laboratory genetic drift, we have exhaustively mapped all InDels that occurred within a single gene. We thus compared the naive InDel repertoire that results from DNA polymerase slippage to the frame-shifting InDels tolerated following selection to maintain protein function. We found that InDels repeatedly occurred, and were bypassed, within homonucleotide repeats of 3-8 bases. The longer the repeat, the higher was the frequency of InDels formation, and the more frequent was their bypass. Besides an expected 8A repeat, other types of repeats, including short ones, and G and C repeats, were bypassed. Although obtained in vitro, our results indicate a direct link between the genetic occurrence of InDels and their phenotypic rescue, thus suggesting a potential role for frame-shifting InDels as bridging evolutionary intermediates.

Eynat Dellus-Gur, Agnes Toth-Petroczy, Mikael Elias, Dan S Tawfik
What makes a protein fold amenable to functional innovation? Fold polarity and stability trade-offs.
J Mol Biol, 425(14) 2609-2621 (2013)
Protein evolvability includes two elements--robustness (or neutrality, mutations having no effect) and innovability (mutations readily inducing new functions). How are these two conflicting demands bridged? Does the ability to bridge them relate to the observation that certain folds, such as TIM barrels, accommodate numerous functions, whereas other folds support only one? Here, we hypothesize that the key to innovability is polarity--an active site composed of flexible, loosely packed loops alongside a well-separated, highly ordered scaffold. We show that highly stabilized variants of TEM-1 β-lactamase exhibit selective rigidification of the enzyme's scaffold while the active-site loops maintained their conformational plasticity. Polarity therefore results in stabilizing, compensatory mutations not trading off, but instead promoting the acquisition of new activities. Indeed, computational analysis indicates that in folds that accommodate only one function throughout evolution, for example, dihydrofolate reductase, ≥ 60% of the active-site residues belong to the scaffold. In contrast, folds associated with multiple functions such as the TIM barrel show high scaffold-active-site polarity (~20% of the active site comprises scaffold residues) and >2-fold higher rates of sequence divergence at active-site positions. Our work suggests structural measures of fold polarity that appear to be correlated with innovability, thereby providing new insights regarding protein evolution, design, and engineering.

Agnes Tóth-Petróczy, Dan S Tawfik
Protein insertions and deletions enabled by neutral roaming in sequence space.
Mol Biol Evol, 30(4) 761-771 (2013)
Backbone modifications via insertions and deletions (InDels) may exert dramatic effects, for better (mediating new functions) and for worse (causing loss of structure and/or function). However, contrary to point mutations (substitutions), our knowledge of the evolution and structural-functional effects of InDels is limited and so is our capability to engineer them. We sought to assess how deleterious InDels are relative to point mutations and understand the mechanisms that mediate their acceptance. Analysis of the evolution of InDels in orthologous protein phylogenies indicated that their rate of purging is 9- to 100-fold higher than for point mutations. In yeast, for example, the substitutions-to-InDels ratio is approximately 14-fold higher in protein coding than in noncoding regions. The incorporation of InDels relative to substitutions is not only slow but also nonlinear. On average, ≥50 substitutions accumulate before the appearance of the first InDel. We also found enriched substitutions in sequential and spatial proximity to InDels, suggesting that certain substitutions are correlated with InDels. As indicated by the lag in InDels accumulation, some of these correlated substitutions may have occurred first, as apparently neutral mutations, and later enabled the accumulation of InDels that would be otherwise purged. Thus, compensatory substitutions may follow InDels in an "adaptive walk" as traditionally assumed, but might also accumulate first, by "neutral roaming." The dynamics of InDels accumulation also depends on their genomic frequencies-InDels in flies are 4-fold more frequent than in yeast and tend to be compensated rather than enabled.

Tzachi Hagai, Ágnes Tóth-Petróczy, Ariel Azia, Yaakov Levy
The origins and evolution of ubiquitination sites.
Mol Biosyst, 8(7) 1865-1877 (2012)
Protein ubiquitination is central to the regulation of various pathways in eukaryotes. The process of ubiquitination and its cellular outcome were investigated in hundreds of proteins to date. Despite this, the evolution of this regulatory mechanism has not yet been addressed comprehensively. Here, we quantify the rates of evolutionary changes of ubiquitination and SUMOylation (Small Ubiquitin-like MOdifier) sites. We estimate the time at which they first appeared, and compare them to acetylation and phosphorylation sites and to unmodified residues. We observe that the various modification sites studied exhibit similar rates. Mammalian ubiquitination sites are weakly more conserved than unmodified lysine residues, and a higher degree of relative conservation is observed when analyzing bona fide ubiquitination sites. Various reasons can be proposed for the limited level of excess conservation of ubiquitination, including shifts in locations of the sites, the presence of alternative sites, and changes in the regulatory pathways. We observe that disappearance of sites may be compensated by the presence of a lysine residue in close proximity, which is significant when compared to evolutionary patterns of unmodified lysine residues, especially in disordered regions. This emphasizes the importance of analyzing a window in the vicinity of functional residues, as well as the capability of the ubiquitination machinery to ubiquitinate residues in a certain region. Using prokaryotic orthologs of ubiquitinated proteins, we study how ubiquitination sites were formed, and observe that while sometimes sequence additions and rearrangements are involved, in many cases the ubiquitination machinery utilizes an already existing sequence without significantly changing it. Finally, we examine the evolution of ubiquitination, which is linked with other modifications, to infer how these complex regulatory modules have evolved. Our study gives initial insights into the formation of ubiquitination sites, their degree of conservation in various species, and their co-evolution with other posttranslational modifications.

Tzachi Hagai, Ariel Azia, Ágnes Tóth-Petróczy, Yaakov Levy
Intrinsic disorder in ubiquitination substrates.
J Mol Biol, 412(3) 319-324 (2011)
The ubiquitin-proteasome system is responsible for the degradation of numerous proteins in eukaryotes. Degradation is an essential process in many cellular pathways and involves the proteasome degrading a wide variety of unrelated substrates while retaining specificity in terms of its targets for destruction and avoiding unneeded proteolysis. How the proteasome achieves this task is the subject of intensive research. Many proteins are targeted for degradation by being covalently attached to a poly-ubiquitin chain. Several studies have indicated the importance of a disordered region for efficient degradation. Here, we analyze a data set of 482 in vivo ubiquitinated substrates and a subset in which ubiquitination is known to mediate degradation. We show that, in contrast to phosphorylation sites and other regulatory regions, ubiquitination sites do not tend to be located in disordered regions and that a large number of substrates are modified at structured regions. In degradation-mediated ubiquitination, there is a significant bias of ubiquitination sites to be in disordered regions; however, a significant number is still found in ordered regions. Moreover, in many cases, disordered regions are absent from ubiquitinated substrates or are located far away from the modified region. These surprising findings raise the question of how these proteins are successfully unfolded and ultimately degraded by the proteasome. They indicate that the folded domain must be perturbed by some additional factor, such as the p97 complex, or that ubiquitination may induce unfolding.

Agnes Tóth-Petróczy, Dan S Tawfik
Slow protein evolutionary rates are dictated by surface-core association.
Proc Natl Acad Sci U.S.A., 108(27) 11151-11156 (2011)
Why do certain proteins evolve much slower than others? We compared not only rates per protein, but also rates per position within individual proteins. For ∼90% of proteins, the distribution of positional rates exhibits three peaks: a peak of slow evolving residues, with average log(2)[normalized rate], log(2)μ, of ca. -2, corresponding primarily to core residues; a peak of fast evolving residues (log(2)μ ∼ 0.5) largely corresponding to surface residues; and a very fast peak (log(2)μ ∼ 2) associated with disordered segments. However, a unique fraction of proteins that evolve very slowly exhibit not only a negligible fast peak, but also a peak with a log(2)μ ∼ -4, rather than the standard core peak of -2. Thus, a "freeze" of a protein's surface seems to stop core evolution as well. We also observed a much higher fraction of substitutions in potentially interacting residues than expected by chance, including substitutions in pairs of contacting surface-core residues. Overall, the data suggest that accumulation of surface substitutions enables the acceptance of substitutions in core positions. The underlying reason for slow evolution might therefore be a highly constrained surface due to protein-protein interactions or the need to prevent misfolding or aggregation. If the surface is inaccessible to substitutions, so becomes the core, thus resulting in very slow overall rates.

Agnes Tóth-Petróczy, Istvan Simon, Monika Fuxreiter, Yaakov Levy
Disordered tails of homeodomains facilitate DNA recognition by providing a trade-off between folding and specific binding.
J Am Chem Soc, 131(42) 15084-15085 (2009)
DNA binding specificity of homeodomain transcription factors is critically affected by disordered N-terminal tails (N-tails) that undergo a disorder-to-order transition upon interacting with DNA. The mechanism of the binding process and the molecular basis of selectivity are largely unknown. The coupling between folding and DNA binding of Antp and NK-2 homeodomains was investigated by coarse-grained molecular dynamics simulations using the native protein-DNA complex. The disordered N-tails were found to decrease the stability of the free proteins by competing with the native intramolecular interactions and increasing the radius of gyration of the homeodomain cores. In the presence of DNA, however, the N-tails increase the stability of the homeodomains by reducing the coupling between folding and DNA binding. Detailed studies on Antp demonstrate that the N-tail anchors the homeodomain to DNA and accelerates formation of specific interactions all along the protein-DNA interface. The tidal electrostatic forces between the N-tail and DNA induce faster and tighter binding of the homeodomain core to the DNA; this mechanism conforms to a fly-casting mechanism. In agreement with experiments, the N-tail of Antp also improves the binding affinity for DNA, with a major contribution by the released waters. These results imply that varying the degree of folding upon binding and thereby modulating the size of the buried surface-disordered N-tails of homeodomains can fine-tune the binding strength for specific DNA sequences. Overall, both the kinetics and thermodynamics of specific DNA binding by homeodomains can be improved by N-tails using a mechanism that is inherent in their disordered state.

Agnes Tóth-Petróczy, Christopher J Oldfield, István Simon, Yuichiro Takagi, A. Keith Dunker, Vladimir N Uversky, Monika Fuxreiter
Malleable machines in transcription regulation: the mediator complex.
PLoS Comput Biol, 4(12) Art. No. e1000243 (2008)
Open Access DOI
The Mediator complex provides an interface between gene-specific regulatory proteins and the general transcription machinery including RNA polymerase II (RNAP II). The complex has a modular architecture (Head, Middle, and Tail) and cryoelectron microscopy analysis suggested that it undergoes dramatic conformational changes upon interactions with activators and RNAP II. These rearrangements have been proposed to play a role in the assembly of the preinitiation complex and also to contribute to the regulatory mechanism of Mediator. In analogy to many regulatory and transcriptional proteins, we reasoned that Mediator might also utilize intrinsically disordered regions (IDRs) to facilitate structural transitions and transmit transcriptional signals. Indeed, a high prevalence of IDRs was found in various subunits of Mediator from both Saccharomyces cerevisiae and Homo sapiens, especially in the Tail and the Middle modules. The level of disorder increases from yeast to man, although in both organisms it significantly exceeds that of multiprotein complexes of a similar size. IDRs can contribute to Mediator's function in three different ways: they can individually serve as target sites for multiple partners having distinctive structures; they can act as malleable linkers connecting globular domains that impart modular functionality on the complex; and they can also facilitate assembly and disassembly of complexes in response to regulatory signals. Short segments of IDRs, termed molecular recognition features (MoRFs) distinguished by a high protein-protein interaction propensity, were identified in 16 and 19 subunits of the yeast and human Mediator, respectively. In Saccharomyces cerevisiae, the functional roles of 11 MoRFs have been experimentally verified, and those in the Med8/Med18/Med20 and Med7/Med21 complexes were structurally confirmed. Although the Saccharomyces cerevisiae and Homo sapiens Mediator sequences are only weakly conserved, the arrangements of the disordered regions and their embedded interaction sites are quite similar in the two organisms. All of these data suggest an integral role for intrinsic disorder in Mediator's function.

Agnes Toth-Petroczy, Agnes Szilagyi, Zsolt Ronai#, Maria Sasvari-Szekely, András Guttman#
Validation of a tentative microsatellite marker for the dopamine D4 receptor gene by capillary gel electrophoresis.
J Chromatogr A, 1130(2) 201-205 (2006)
Two to four-basepair-short tandem repeats (i.e. microsatellites) are broadly utilized as genetic markers for mapping disease loci in whole genome search analyses. Based on their close vicinity on chromosome 11, the D11S1984 microsatellite was anticipated as a tentative marker for the dopamine D4 receptor gene. A capillary gel electrophoresis based genotype analysis method and an in-house made computational tool was developed for the analysis of the D11S1984 microsatellite marker to examine a healthy Hungarian population of n=106. The data obtained did not suggest significant linkage between the D11S1984 marker and the DRD4 gene.

Google Scholar