* joined first author # joined corresponding author

Johannes Girstmair, HongKee Moon, Charlène Brillard, Robert Haase, Pavel Tomancak
Time to Upgrade: A New OpenSPIM Guide to Build and Operate Advanced OpenSPIM Configurations.
Adv Biol (Weinh), 6(4) Art. No. e2101182 (2022)
Open Access DOI
OpenSPIM is an Open Access platform for Selective Plane Illumination Microscopy (SPIM) and allows hundreds of laboratories around the world to generate and process light-sheet data in a cost-effective way due to open-source hardware and software. While setting up a basic OpenSPIM configuration can be achieved expeditiously, correctly assembling and operating more complex OpenSPIM configurations can be challenging for routine standard OpenSPIM users. Detailed instructions on how to equip an OpenSPIM with two illumination sides and two detection axes (X-OpenSPIM) are provided, and a solution is also provided on how the temperature can be controlled in the sample chamber. Additionally, it is demonstrated how to operate it by implementing an ArduinoUNO microcontroller and introducing μOpenSPIM, a new software plugin for OpenSPIM, to facilitate image acquisition. The new software works on any OpenSPIM configuration comes with drift correction functionality, on-the-fly image processing, and gives users more options in the way time-lapse movies are initially set up and saved. Step-by-step guides are also provided within the Supporting Information and on the website on how to align the lasers, configure the hardware, and acquire images using μOpenSPIM. With this, current OpenSPIM users are empowered in various ways, and newcomers striving to use more advanced OpenSPIM systems are helped.

Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V Bzikadze, Alla Mikheenko, Mitchell R Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, Sergey Aganezov, Savannah J Hoyt, Mark Diekhans, Glennis A Logsdon, Michael Alonge, Stylianos E Antonarakis, Matthew Borchers, Gerard G Bouffard, Shelise Y Brooks, Gina V Caldas, Nae-Chyun Chen, Haoyu Cheng, Chen-Shan Chin, William Chow, Leonardo G de Lima, Philip C Dishuck, Richard Durbin, Tatiana Dvorkina, Ian T Fiddes, Giulio Formenti, Robert S Fulton, Arkarachai Fungtammasan, Erik Garrison, Patrick G S Grady, Tina A Graves-Lindsay, Ira M Hall, Nancy F Hansen, Gabrielle A Hartley, Marina Haukness, Kerstin Howe, Michael W Hunkapiller, Chirag Jain, Miten Jain, Erich D Jarvis, Peter Kerpedjiev, Melanie Kirsche, Mikhail Kolmogorov, Jonas Korlach, Milinn Kremitzki, Heng Li, Valerie V Maduro, Tobias Marschall, Ann M McCartney, J McDaniel, Danny E Miller, James C Mullikin, Eugene W Myers, Nathan D Olson, Benedict Paten, Paul Peluso, Pavel Pevzner, David Porubsky, Tamara Potapova, Evgeny I Rogaev, Jeffrey A Rosenfeld, Steven L Salzberg, Valerie A Schneider, Fritz J Sedlazeck, Kishwar Shafin, Colin J Shew, Alaina Shumate, Ying Sims, Arian F A Smit, Daniela C Soto, Ivan Sović, Jessica M Storer, Aaron Streets, Beth A Sullivan, Francoise Thibaud-Nissen, James Torrance, Justin Wagner, Brian Walenz, Aaron M Wenger, Jonathan Wood, Chunlin Xiao, Stephanie M Yan, Alice C Young, Samantha Zarate, Urvashi Surti, Rajiv C McCoy, Megan Y Dennis, Ivan A Alexandrov, Jennifer L Gerton, Rachel J O'Neill, Winston Timp, Justin M Zook, Michael C Schatz, Evan E Eichler, Karen H Miga, Adam M Phillippy
The complete sequence of a human genome.
Science, 376(6588) 44-53 (2022)
Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion-base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies.

Giulio Formenti✳︎#, Arang Rhie✳︎#, Brian Walenz, Francoise Thibaud-Nissen, Kishwar Shafin, Sergey Koren, Eugene W Myers, Erich D Jarvis, Adam M Phillippy
Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation.
Nat Methods, Art. No. doi: 10.1038/s41592-022-01445-y (2022)
Variant calling has been widely used for genotyping and for improving the consensus accuracy of long-read assemblies. Variant calls are commonly hard-filtered with user-defined cutoffs. However, it is impossible to define a single set of optimal cutoffs, as the calls heavily depend on the quality of the reads, the variant caller of choice and the quality of the unpolished assembly. Here, we introduce Merfin, a k-mer based variant-filtering algorithm for improved accuracy in genotyping and genome assembly polishing. Merfin evaluates each variant based on the expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller's internal score. Merfin increased the precision of genotyped calls in several benchmarks, improved consensus accuracy and reduced frameshift errors when applied to human and nonhuman assemblies built from Pacific Biosciences HiFi and continuous long reads or Oxford Nanopore reads, including the first complete human genome. Moreover, we introduce assembly quality and completeness metrics that account for the expected genomic copy numbers.

Kang Du, Martin Pippel, Susanne Kneitz, Romain Feron, Irene da Cruz, Sylke Winkler, Brigitta Wilde, Edgar G Avila Luna, Gene Myers, Yann Guiguen, Constantino Macias Garcia, Manfred Schartl
Genome biology of the darkedged splitfin, Girardinichthys multiradiatus, and the evolution of sex chromosomes and placentation.
Genome Res, 32(3) 583-594 (2022)
Viviparity evolved independently about 150 times in vertebrates and more than 20 times in fish. Several lineages added to the protection of the embryo inside the body of the mother, the provisioning of nutrients, and physiological exchange. This often led to the evolution of a placenta. Among fish, one of the most complex systems serving the function of the placenta is the embryonal trophotaenia/ovarian luminal epithelium of the goodeid fishes. For a better understanding of this feature and others of this group of fishes, high-quality genomic resources are essential. We have sequenced the genome of the darkedged splitfin, Girardinichthys multiradiatus The assembly is chromosome level and includes the X and Y Chromosomes. A large male-specific region on the Y was identified covering 80% of Chromosome 20, allowing some first inferences on the recent origin and a candidate male sex determining gene. Genome-wide transcriptomics uncovered sex-specific differences in brain gene expression with an enrichment for neurosteroidogenesis and testis genes in males. The expression signatures of the splitfin embryonal and maternal placenta showed overlap with homologous tissues including human placenta, the ovarian follicle epithelium of matrotrophic poeciliid fish species and the brood pouch epithelium of the seahorse. Our comparative analyses on the evolution of embryonal and maternal placenta indicate that the evolutionary novelty of maternal provisioning development repeatedly made use of genes that already had the same function in other tissues. In this way, preexisting modules are assembled and repurposed to provide the molecular changes for this novel trait.

Fabio Cunial✳︎, Olgert Denas✳︎, Djamal Belazzougui
Fast and compact matching statistics analytics.
Bioinformatics, 38(7) 1838-1845 (2022)
Fast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences.

Harris A Lewin, Stephen Richards, Erez Lieberman Aiden, Miguel L Allende, John M Archibald, Miklós Bálint, Katharine B Barker, Benedikt Baumgartner, Katherine Belov, Giorgio Bertorelle, Mark L Blaxter, Jing Cai, Nicolette D Caperello, Keith Carlson, Juan Carlos Castilla-Rubio, Shu-Miaw Chaw, Lei Chen, Anna K Childers, Jonathan A Coddington, Dalia A Conde, Montserrat Corominas, Keith A Crandall, Andrew J Crawford, Federica DiPalma, Richard Durbin, ThankGod E Ebenezer, Scott V Edwards, Olivier Fedrigo, Paul Flicek, Giulio Formenti, Richard A Gibbs, M Thomas P Gilbert, Melissa M Goldstein, Jennifer Marshall Graves, Henry T Greely, Ilya Grigoriev, Kevin J Hackett, Neil Hall, David Haussler, Kristofer M Helgen, Carolyn J Hogg, Sachiko Isobe, Kjetill Sigurd Jakobsen, Axel Janke, Erich D Jarvis, Warren E Johnson, Steven J. M. Jones, Elinor K Karlsson, Paul J Kersey, Jin-Hyoung Kim, W John Kress, Shigehiro Kuraku, Mara K N Lawniczak, James H Leebens-Mack, Xueyan Li, Kerstin Lindblad-Toh, Xin Liu, Jose Victor Lopez, Tomas Marques-Bonet, Sophie Mazard, Jonna A K Mazet, Camila J Mazzoni, Eugene W Myers, Rachel J O'Neill, Sadye Paez, Hyun Park, Gene E Robinson, Cristina Roquet, Oliver A Ryder, Jamal S M Sabir, H Bradley Shaffer, Timothy M Shank, Jacob S Sherkow, Pamela S Soltis, Boping Tang, Leho Tedersoo, Marcela Uliano-Silva, Kun Wang, Xiaofeng Wei, Regina Wetzer, Julia L Wilson, Xun Xu, Huanming Yang, Anne D Yoder, Guojie Zhang
The Earth BioGenome Project 2020: Starting the clock.
Proc Natl Acad Sci U.S.A., 119(4) Art. No. e2115635118 (2022)
Open Access DOI

Mara K N Lawniczak, Richard Durbin, Paul Flicek, Kerstin Lindblad-Toh, Xiaofeng Wei, John M Archibald, William J Baker, Katherine Belov, Mark L Blaxter, Tomas Marques Bonet, Anna K Childers, Jonathan A Coddington, Keith A Crandall, Andrew J Crawford, Robert P Davey, Federica Di Palma, Qi Fang, Wilfried Haerty, Neil Hall, Katharina J Hoff, Kerstin Howe, Erich D Jarvis, Warren E Johnson, Rebecca N Johnson, Paul J Kersey, Xin Liu, Jose Victor Lopez, Eugene W Myers, Olga Vinnere Pettersson, Adam M Phillippy, Monica F Poelchau, Kim D Pruitt, Arang Rhie, Juan Carlos Castilla-Rubio, Sanjeeb Kumar Sahu, Nicholas A Salmon, Pamela S Soltis, David Swarbreck, Francoise Thibaud-Nissen, Sibo Wang, Jill L Wegrzyn, Guojie Zhang, He Zhang, Harris A Lewin, Stephen Richards
Standards recommendations for the Earth BioGenome Project.
Proc Natl Acad Sci U.S.A., 119(4) Art. No. e2115639118 (2022)
Open Access DOI
A global international initiative, such as the Earth BioGenome Project (EBP), requires both agreement and coordination on standards to ensure that the collective effort generates rapid progress toward its goals. To this end, the EBP initiated five technical standards committees comprising volunteer members from the global genomics scientific community: Sample Collection and Processing, Sequencing and Assembly, Annotation, Analysis, and IT and Informatics. The current versions of the resulting standards documents are available on the EBP website, with the recognition that opportunities, technologies, and challenges may improve or change in the future, requiring flexibility for the EBP to meet its goals. Here, we describe some highlights from the proposed standards, and areas where additional challenges will need to be met.

Arne Ludwig, Martin Pippel, Gene Myers, Michael Hiller
DENTIST-using long reads for closing assembly gaps at high accuracy.
GigaScience, 11 Art. No. giab100 (2022)
Open Access DOI
Long sequencing reads allow increasing contiguity and completeness of fragmented, short-read-based genome assemblies by closing assembly gaps, ideally at high accuracy. While several gap-closing methods have been developed, these methods often close an assembly gap with sequence that does not accurately represent the true sequence.

Yuhan Wang, Mark Eddison, Greg Fleishman, Martin Weigert, Shengjin Xu, Tim Wang, Konrad Rokicki, Cristian Goina, Fredrick E Henry, Andrew L Lemire, Uwe Schmidt, Hui Yang, Karel Svoboda, Eugene W Myers, Stephan Saalfeld, Wyatt Korff, Scott M Sternson#, Paul W Tillberg#
EASI-FISH for thick tissue defines lateral hypothalamus spatio-molecular organization.
Cell, 184(26) 6361-6377 (2021)
Open Access DOI
Determining the spatial organization and morphological characteristics of molecularly defined cell types is a major bottleneck for characterizing the architecture underpinning brain function. We developed Expansion-Assisted Iterative Fluorescence In Situ Hybridization (EASI-FISH) to survey gene expression in brain tissue, as well as a turnkey computational pipeline to rapidly process large EASI-FISH image datasets. EASI-FISH was optimized for thick brain sections (300 μm) to facilitate reconstruction of spatio-molecular domains that generalize across brains. Using the EASI-FISH pipeline, we investigated the spatial distribution of dozens of molecularly defined cell types in the lateral hypothalamic area (LHA), a brain region with poorly defined anatomical organization. Mapping cell types in the LHA revealed nine spatially and molecularly defined subregions. EASI-FISH also facilitates iterative reanalysis of scRNA-seq datasets to determine marker-genes that further dissociated spatial and morphological heterogeneity. The EASI-FISH pipeline democratizes mapping molecularly defined cell types, enabling discoveries about brain organization.

Diana D Moreno Santillán, Tanya M Lama, Yocelyn T Gutierrez Guerrero, Alexis M Brown, Paul Donat, Huabin Zhao, Stephen J Rossiter, Laurel R Yohe, Joshua H Potter, Emma Teeling, Sonja Vernes, Kalina T J Davies, Eugene W Myers, Graham M Hughes, Zixia Huang, Federico Hoffmann, Angelique P Corthals, David A Ray#, Liliana M Dávalos#
Large-scale genome sampling reveals unique immunity and metabolic adaptations in bats.
Mol Ecol, 30(23) 6449-6467 (2021)
Comprising more than 1,400 species, bats possess adaptations unique among mammals including powered flight, unexpected longevity, and extraordinary immunity. Some of the molecular mechanisms underlying these unique adaptations includes DNA repair, metabolism and immunity. However, analyses have been limited to a few divergent lineages, reducing the scope of inferences on gene family evolution across the Order Chiroptera. We conducted an exhaustive comparative genomic study of 37 bat species, one generated in this study, encompassing a large number of lineages, with a particular emphasis on multi-gene family evolution across immune and metabolic genes. In agreement with previous analyses, we found lineage-specific expansions of the APOBEC3 and MHC-I gene families, and loss of the proinflammatory PYHIN gene family. We inferred more than 1,000 gene losses unique to bats, including genes involved in the regulation of inflammasome pathways such as epithelial defence receptors, the natural killer gene complex and the interferon-gamma induced pathway. Gene set enrichment analyses revealed genes lost in bats are involved in defence response against pathogen-associated molecular patterns and damage-associated molecular patterns. Gene family evolution and selection analyses indicate bats have evolved fundamental functional differences compared to other mammals in both innate and adaptive immune system, with the potential to enhance antiviral immune response while dampening inflammatory signalling. In addition, metabolic genes have experienced repeated expansions related to convergent shifts to plant-based diets. Our analyses support the hypothesis that, in tandem with flight, ancestral bats had evolved a unique set of immune adaptations whose functional implications remain to be explored.

Lina Muhandes, Maria Chapsa, Martin Pippel, Rayk Behrendt, Yan Ge, Andreas Dahl, Buqing Yi, Alexander Dalpke, Sylke Winkler, Michael Hiller, Sebastien Boutin, Stefan Beissert, Rolf Jessberger, Padraic G Fallon, Axel Roers
Low Threshold for Cutaneous Allergen Sensitization but No Spontaneous Dermatitis or Atopy in FLG-Deficient Mice.
J Invest Dermatol, 141(11) 2611-2619 (2021)
Open Access DOI
Loss of FLG causes ichthyosis vulgaris. Reduced FLG expression compromises epidermal barrier function and is associated with atopic dermatitis, allergy, and asthma. The flaky tail mouse harbors two mutations that affect the skin barrier, Flgft, resulting in hypomorphic FLG expression, and Tmem79ma, inactivating TMEM79. Mice defective only for TMEM79 featured dermatitis and systemic atopy, but also Flgft/ft BALB/c congenic mice developed eczema, high IgE, and spontaneous asthma, suggesting that FLG protects from atopy. In contrast, a targeted Flg-knockout mutation backcrossed to BALB/c did not result in dermatitis or atopy. To resolve this discrepancy, we generated FLG-deficient mice on pure BALB/c background by inactivating Flg in BALB/c embryos. These mice feature an ichthyosis phenotype, barrier defect, and facilitated percutaneous sensitization. However, they do not develop dermatitis or atopy. Whole-genome sequencing of the atopic Flgft BALB/c congenics revealed that they were homozygous for the atopy-causing Tmem79matted mutation. In summary, we show that FLG deficiency does not cause atopy in mice, in line with lack of atopic disease in a fraction of patients with ichthyosis vulgaris carrying two Flg null alleles. However, the absence of FLG likely promotes and modulates dermatitis caused by other genetic barrier defects.

Chentao Yang✳︎, Yang Zhou✳︎, Stephanie Marcus✳︎, Giulio Formenti, Lucie A Bergeron, Zhenzhen Song, Xupeng Bi, Juraj Bergman, Marjolaine Marie C Rousselle, Chengran Zhou, Long Zhou, Yuan Deng, Miaoquan Fang, Duo Xie, Yuanzhen Zhu, Shangjin Tan, Jacquelyn Mountcastle, Bettina Haase, Jennifer Balacco, Jonathan Wood, William Chow, Arang Rhie, Martin Pippel, Margaret M Fabiszak, Sergey Koren, Olivier Fedrigo, Winrich A Freiwald, Kerstin Howe, Huanming Yang, Adam M Phillippy, Mikkel Heide Schierup, Erich D Jarvis, Guojie Zhang
Evolutionary and biomedical insights from a marmoset diploid genome assembly.
Nature, 594(7862) 227-233 (2021)
Open Access DOI
The accurate and complete assembly of both haplotype sequences of a diploid organism is essential to understanding the role of variation in genome functions, phenotypes and diseases1. Here, using a trio-binning approach, we present a high-quality, diploid reference genome, with both haplotypes assembled independently at the chromosome level, for the common marmoset (Callithrix jacchus), an primate model system that is widely used in biomedical research2,3. The full spectrum of heterozygosity between the two haplotypes involves 1.36% of the genome-much higher than the 0.13% indicated by the standard estimation based on single-nucleotide heterozygosity alone. The de novo mutation rate is 0.43 × 10-8 per site per generation, and the paternal inherited genome acquired twice as many mutations as the maternal. Our diploid assembly enabled us to discover a recent expansion of the sex-differentiation region and unique evolutionary changes in the marmoset Y chromosome. In addition, we identified many genes with signatures of positive selection that might have contributed to the evolution of Callithrix biological features. Brain-related genes were highly conserved between marmosets and humans, although several genes experienced lineage-specific copy number variations or diversifying selection, with implications for the use of marmosets as a model system.

Shinichi Morishita#, Kazuki Ichikawa, Eugene W Myers#
Finding long tandem repeats in long noisy reads.
Bioinformatics, 37(5) 612-621 (2021)
Open Access DOI
Long tandem repeat expansions of more than 1000 nt have been suggested to be associated with diseases, but remain largely unexplored in individual human genomes because read lengths have been too short. However, new long-read sequencing technologies can produce single reads of 10 000 nt or more that can span such repeat expansions, although these long reads have high error rates, of 10-20%, which complicates the detection of repetitive elements. Moreover, most traditional algorithms for finding tandem repeats are designed to find short tandem repeats (<1000 nt) and cannot effectively handle the high error rate of long reads in a reasonable amount of time.

Giulio Formenti#, Arang Rhie, Jennifer Balacco, Bettina Haase, Jacquelyn Mountcastle, Olivier Fedrigo, Samara Brown, Marco Rosario Capodiferro, Farooq O Al-Ajli, Roberto Ambrosini, Peter Houde, Sergey Koren, Karen Oliver, Michelle Smith, Jason Skelton, Emma Betteridge, Jale Dolucan, Craig Corton, Iliana Bista, James Torrance, Alan Tracey, Jonathan Wood, Marcela Uliano-Silva, Kerstin Howe, Shane A McCarthy, Sylke Winkler, Woori Kwak, Jonas Korlach, Arkarachai Fungtammasan, Daniel Fordham, Vania Costa, Simon Mayes, Matteo Chiara, David S Horner, Eugene W Myers, Richard Durbin, Alessandro Achilli, Edward L Braun, Adam M Phillippy, Erich D Jarvis#, Erich D null
Complete vertebrate mitogenomes reveal widespread repeats and gene duplications.
Genome Biol, 22(1) Art. No. 120 (2021)
Open Access DOI
Modern sequencing technologies should make the assembly of the relatively small mitochondrial genomes an easy undertaking. However, few tools exist that address mitochondrial assembly directly.

Arang Rhie, Shane A McCarthy, Olivier Fedrigo, Joana Damas, Giulio Formenti, Sergey Koren, Marcela Uliano-Silva, William Chow, Arkarachai Fungtammasan, Juwan Kim, Chul Lee, Byung June Ko, Mark Chaisson, Gregory L Gedman, Lindsey J Cantin, Francoise Thibaud-Nissen, Leanne Haggerty, Iliana Bista, Michelle Smith, Bettina Haase, Jacquelyn Mountcastle, Sylke Winkler, Sadye Paez, Jonathon Howard, Sonja Vernes, Tanya M Lama, Frank Grutzner, Wesley C Warren, Christopher N Balakrishnan, Dave Burt, Julia M George, Matthew T Biegler, David Iorns, Andrew Digby, Daryl Eason, Bruce Robertson, Taylor Edwards, Mark Wilkinson, George Turner, Axel Meyer, Andreas F Kautt, Paolo Franchini, H William Detrich, Hannes Svardal, Maximilian Wagner, Gavin J P Naylor, Martin Pippel, Milan Malinsky, Mark Mooney, Maria Simbirsky, Brett T Hannigan, Trevor Pesout, Marlys Houck, Ann Misuraca, Sarah B Kingan, Richard J Hall, Zev Kronenberg, Ivan Sović, Christopher Dunn, Zemin Ning, Alex R Hastie, Joyce Lee, Siddarth Selvaraj, Richard E Green, Nicholas H Putnam, Ivo Gut, Jay Ghurye, Erik Garrison, Ying Sims, Joanna Collins, Sarah Pelan, James Torrance, Alan Tracey, Jonathan Wood, Robel E Dagnew, Dengfeng Guan, Sarah E London, David F Clayton, Claudio V Mello, Samantha R Friedrich, Peter V Lovell, Ekaterina Osipova, Farooq O Al-Ajli, Simona Secomandi, Heebal Kim, Constantina Theofanopoulou, Michael Hiller, Yang Zhou, Robert S Harris, Kateryna D Makova, Paul Medvedev, Jinna Hoffman, Patrick Masterson, Karen Clark, Fergal Martin, Kerstin Howe#, Paul Flicek, Brian Walenz, Woori Kwak, Hiram Clawson, Mark Diekhans, Luis Nassar, Benedict Paten, Robert H S Kraus, Andrew J Crawford, M Thomas P Gilbert, Guojie Zhang, Byrappa Venkatesh, Robert F Murphy, Klaus-Peter Koepfli, Beth Shapiro, Warren E Johnson, Federica Di Palma, Tomas Marques-Bonet, Emma Teeling, Tandy Warnow, Jennifer Marshall Graves, Oliver A Ryder, David Haussler, Stephen J O'Brien, Jonas Korlach, Harris A Lewin, Kerstin Howe#, Eugene W Myers#, Richard Durbin#, Adam M Phillippy#, Erich D Jarvis#
Towards complete and error-free genome assemblies of all vertebrate species.
Nature, 592(7856) 737-746 (2021)
Open Access DOI
High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species1-4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.

Peiwen Xiong, C Darrin Hulsey, Carmelo Fruciano, Wai Y Wong, Alexander Nater, Andreas F Kautt, Oleg Simakov, Martin Pippel, Shigehiro Kuraku, Axel Meyer, Paolo Franchini
The comparative genomic landscape of adaptive radiation in crater lake cichlid fishes.
Mol Ecol, 30(4) 955-972 (2020)
Open Access DOI
Factors ranging from ecological opportunity to genome composition might explain why only some lineages form adaptive radiations. While being rare, particular systems can provide natural experiments within an identical ecological setting where species numbers and phenotypic divergence in two closely related lineages are notably different. We investigated one such natural experiment using two de novo assembled and 40 resequenced genomes and asked why two closely related Neotropical cichlid fish lineages, the Amphilophus citrinellus species complex (Midas cichlids; radiating) and Archocentrus centrarchus (Flyer cichlid; nonradiating), have resulted in such disparate evolutionary outcomes. Although both lineages inhabit many of the same Nicaraguan lakes, whole-genome inferred demography suggests that priority effects are not likely to be the cause of the dissimilarities. Also, genome-wide levels of selection, transposable element dynamics, gene family expansion, major chromosomal rearrangements and the number of genes under positive selection were not markedly different between the two lineages. To more finely investigate particular subsets of the genome that have undergone adaptive divergence in Midas cichlids, we also examined if there was evidence for 'molecular pre-adaptation' in regions identified by QTL mapping of repeatedly diverging adaptive traits. Although most of our analyses failed to pinpoint substantial genomic differences, we did identify functional categories containing many genes under positive selection that provide candidates for future studies on the propensity of Midas cichlids to radiate. Our results point to a disproportionate role of local, rather than genome-wide factors underlying the propensity for these cichlid fishes to adaptively radiate.

Yuta Suzuki#, Eugene W Myers, Shinichi Morishita#
Rapid and ongoing evolution of repetitive sequence structures in human centromeres.
Sci Adv, 6(50) Art. No. eabd9230 (2020)
Open Access DOI
Our understanding of centromere sequence variation across human populations is limited by its extremely long nested repeat structures called higher-order repeats that are challenging to sequence. Here, we analyzed chromosomes 11, 17, and X using long-read sequencing data for 36 individuals from diverse populations including a Han Chinese trio and 21 Japanese. We revealed substantial structural diversity with many previously unidentified variant higher-order repeats specific to individuals characterizing rapid, haplotype-specific evolution of human centromeric arrays, while frequent single-nucleotide variants are largely conserved. We found a characteristic pattern shared among prevalent variants in human and chimpanzee. Our findings pave the way for studying sequence evolution in human and primate centromeres.

Andreas F Kautt, Claudius F Kratochwil, Alexander Nater, Gonzalo Machado-Schiaffino, Melisa Olave, Frederico Henning, Julián Torres-Dowdall, Andreas Härer, C Darrin Hulsey, Paolo Franchini, Martin Pippel, Eugene W Myers, Axel Meyer
Contrasting signatures of genomic divergence during sympatric speciation.
Nature, 588(7836) 106-111 (2020)
Open Access DOI
The transition from 'well-marked varieties' of a single species into 'well-defined species'-especially in the absence of geographic barriers to gene flow (sympatric speciation)-has puzzled evolutionary biologists ever since Darwin1,2. Gene flow counteracts the buildup of genome-wide differentiation, which is a hallmark of speciation and increases the likelihood of the evolution of irreversible reproductive barriers (incompatibilities) that complete the speciation process3. Theory predicts that the genetic architecture of divergently selected traits can influence whether sympatric speciation occurs4, but empirical tests of this theory are scant because comprehensive data are difficult to collect and synthesize across species, owing to their unique biologies and evolutionary histories5. Here, within a young species complex of neotropical cichlid fishes (Amphilophus spp.), we analysed genomic divergence among populations and species. By generating a new genome assembly and re-sequencing 453 genomes, we uncovered the genetic architecture of traits that have been suggested to be important for divergence. Species that differ in monogenic or oligogenic traits that affect ecological performance and/or mate choice show remarkably localized genomic differentiation. By contrast, differentiation among species that have diverged in polygenic traits is genomically widespread and much higher overall, consistent with the evolution of effective and stable genome-wide barriers to gene flow. Thus, we conclude that simple trait architectures are not always as conducive to speciation with gene flow as previously suggested, whereas polygenic architectures can promote rapid and stable speciation in sympatry.

Debayan Saha, Uwe Schmidt, Qinrong Zhang, Aurelien Barbotin, Qi Hu, Na Ji, Martin J. Booth, Martin Weigert#, Eugene W Myers#
Practical sensorless aberration estimation for 3D microscopy with deep learning.
Opt express, 28(20) 29044-29053 (2020)
Open Access DOI
Estimation of optical aberrations from volumetric intensity images is a key step in sensorless adaptive optics for 3D microscopy. Recent approaches based on deep learning promise accurate results at fast processing speeds. However, collecting ground truth microscopy data for training the network is typically very difficult or even impossible thereby limiting this approach in practice. Here, we demonstrate that neural networks trained only on simulated data yield accurate predictions for real experimental images. We validate our approach on simulated and experimental datasets acquired with two different microscopy modalities and also compare the results to non-learned methods. Additionally, we study the predictability of individual aberrations with respect to their data requirements and find that the symmetry of the wavefront plays a crucial role. Finally, we make our implementation freely available as open source software in Python.

Gautam Dey, Sian Culley, Scott Curran, Uwe Schmidt, Ricardo Henriques, Wanda Kukulski, Buzz Baum
Closed mitosis requires local disassembly of the nuclear envelope.
Nature, 585(7823) 119-123 (2020)
At the end of mitosis, eukaryotic cells must segregate the two copies of their replicated genome into two new nuclear compartments1. They do this either by first dismantling and later reassembling the nuclear envelope in an 'open mitosis' or by reshaping an intact nucleus and then dividing it into two in a 'closed mitosis'2,3. Mitosis has been studied in a wide variety of eukaryotes for more than a century4, but how the double membrane of the nuclear envelope is split into two at the end of a closed mitosis without compromising the impermeability of the nuclear compartment remains unknown5. Here, using the fission yeast Schizosaccharomyces pombe (a classical model for closed mitosis5), genetics, live-cell imaging and electron tomography, we show that nuclear fission is achieved via local disassembly of nuclear pores within the narrow bridge that links segregating daughter nuclei. In doing so, we identify the protein Les1, which is localized to the inner nuclear envelope and restricts the process of local nuclear envelope breakdown to the bridge midzone to prevent the leakage of material from daughter nuclei. The mechanism of local nuclear envelope breakdown in a closed mitosis therefore closely mirrors nuclear envelope breakdown in open mitosis3, revealing an unexpectedly high conservation of nuclear remodelling mechanisms across diverse eukaryotes.

David Jebb✳︎, Zixia Huang✳︎, Martin Pippel✳︎, Graham M Hughes, Ksenia Lavrichenko, Paolo Devanna, Sylke Winkler, Lars S Jermiin, Emilia C Skirmuntt, Aris Katzourakis, Lucy Burkitt-Gray, David A Ray, Kevin F. Sullivan, Juliana G. Roscito, Bogdan Kirilenko, Liliana M Dávalos, Angelique P Corthals, Megan L Power, Gareth Jones, Roger D Ransome, Dina K N Dechmann, Andrea G Locatelli, Sébastien J Puechmaille, Olivier Fedrigo, Erich D Jarvis, Michael Hiller#, Sonja Vernes#, Eugene W Myers#, Emma Teeling#
Six reference-quality genomes reveal evolution of bat adaptations.
Nature, 583(7817) 578-584 (2020)
Open Access DOI
Bats possess extraordinary adaptations, including flight, echolocation, extreme longevity and unique immunity. High-quality genomes are crucial for understanding the molecular basis and evolution of these traits. Here we incorporated long-read sequencing and state-of-the-art scaffolding protocols1 to generate, to our knowledge, the first reference-quality genomes of six bat species (Rhinolophus ferrumequinum, Rousettus aegyptiacus, Phyllostomus discolor, Myotis myotis, Pipistrellus kuhlii and Molossus molossus). We integrated gene projections from our 'Tool to infer Orthologs from Genome Alignments' (TOGA) software with de novo and homology gene predictions as well as short- and long-read transcriptomics to generate highly complete gene annotations. To resolve the phylogenetic position of bats within Laurasiatheria, we applied several phylogenetic methods to comprehensive sets of orthologous protein-coding and noncoding regions of the genome, and identified a basal origin for bats within Scrotifera. Our genome-wide screens revealed positive selection on hearing-related genes in the ancestral branch of bats, which is indicative of laryngeal echolocation being an ancestral trait in this clade. We found selection and loss of immunity-related genes (including pro-inflammatory NF-κB regulators) and expansions of anti-viral APOBEC3 genes, which highlights molecular mechanisms that may contribute to the exceptional immunity of bats. Genomic integrations of diverse viruses provide a genomic record of historical tolerance to viral infection in bats. Finally, we found and experimentally validated bat-specific variation in microRNAs, which may regulate bat-specific gene-expression programs. Our reference-quality bat genomes provide the resources required to uncover and validate the genomic basis of adaptations of bats, and stimulate new avenues of research that are directly relevant to human health and disease1.

Andre Arashiro Pulschen, Delyan R Mutavchiev, Sian Culley, Kim Nadine Sebastian, Jacques Roubinet, Marc Roubinet, Gabriel Tarrason Risa, Marleen van Wolferen, Chantal Roubinet, Uwe Schmidt, Gautam Dey, Sonja-Verena Albers, Ricardo Henriques, Buzz Baum
Live Imaging of a Hyperthermophilic Archaeon Reveals Distinct Roles for Two ESCRT-III Homologs in Ensuring a Robust and Symmetric Division.
Curr Biol, 30(14) 2852-2859 (2020)
Open Access DOI
Live-cell imaging has revolutionized our understanding of dynamic cellular processes in bacteria and eukaryotes. Although similar techniques have been applied to the study of halophilic archaea [1-5], our ability to explore the cell biology of thermophilic archaea has been limited by the technical challenges of imaging at high temperatures. Sulfolobus are the most intensively studied members of TACK archaea and have well-established molecular genetics [6-9]. Additionally, studies using Sulfolobus were among the first to reveal striking similarities between the cell biology of eukaryotes and archaea [10-15]. However, to date, it has not been possible to image Sulfolobus cells as they grow and divide. Here, we report the construction of the Sulfoscope, a heated chamber on an inverted fluorescent microscope that enables live-cell imaging of thermophiles. By using thermostable fluorescent probes together with this system, we were able to image Sulfolobus acidocaldarius cells live to reveal tight coupling between changes in DNA condensation, segregation, and cell division. Furthermore, by imaging deletion mutants, we observed functional differences between the two ESCRT-III proteins implicated in cytokinesis, CdvB1 and CdvB2. The deletion of cdvB1 compromised cell division, causing occasional division failures, whereas the ΔcdvB2 exhibited a profound loss of division symmetry, generating daughter cells that vary widely in size and eventually generating ghost cells. These data indicate that DNA separation and cytokinesis are coordinated in Sulfolobus, as is the case in eukaryotes, and that two contractile ESCRT-III polymers perform distinct roles to ensure that Sulfolobus cells undergo a robust and symmetrical division.

Coleman Broaddus, Alexander Krull, Martin Weigert, Uwe Schmidt, Gene Myers
Removing Structured Noise with Self-Supervised Blind-Spot Networks.
In: IEEE ISBI 2020 : International Conference on Biomedical Imaging : April 2-7, 2020, Iowa City, Iowa, USA : symposium proceeding (2020) IEEE International Symposium on Biomedical Imaging, Piscataway, N.J., IEEE (2020), 159-163
Removal of noise from fluorescence microscopy images is an important first step in many biological analysis pipelines. Current state-of-the-art supervised methods employ convolutional neural networks that are trained with clean (ground-truth) images. Recently, it was shown that self-supervised image denoising with blind spot networks achieves excellent performance even when ground-truth images are not available, as is common in fluorescence microscopy. However, these approaches, e.g. Noise2Void ( N2V), generally assume pixel-wise independent noise, thus limiting their applicability in situations where spatially correlated (structured) noise is present. To overcome this limitation, we present Structured Noise2Void (STRUCTN2V), a generalization of blind spot networks that enables removal of structured noise without requiring an explicit noise model or ground truth data. Specifically, we propose to use an extended blind mask (rather than a single pixel/blind spot), whose shape is adapted to the structure of the noise. We evaluate our approach on two real datasets and show that STRUCTN2V considerably improves the removal of structured noise compared to existing standard and blind-spot based techniques.

Theresa Suckert✳︎, Johannes Müller✳︎, Elke Beyreuther✳︎, Behnam Azadegan, Anja Brüggemann, Rebecca Bütof, Antje Dietrich, Malte Gotz, Robert Haase, Michael Schürer, Falk Tillner, Cläre von Neubeck, Mechthild Krause, Armin Lühr
High-precision image-guided proton irradiation of mouse brain sub-volumes.
Radiother Oncol, 146 205-212 (2020)
Proton radiotherapy offers the potential to reduce normal tissue toxicity. However, clinical safety margins, range uncertainties, and varying relative biological effectiveness (RBE) may result in a critical dose in tumor-surrounding normal tissue. To assess potential adverse effects in preclinical studies, image-guided proton mouse brain irradiation and analysis of DNA damage repair was established.

Kira Vinogradova, Alexandr Dibrov, Gene Myers
Towards Interpretable Semantic Segmentation via Gradient-Weighted Class Activation Mapping.
In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, the Thirty-Second Innovative Applications of Artificial Intelligence Conference, the Tenth AAAI Symposium on Educational Advances in Artificial Intelligence : New York, New York, USA, February 7-12, 2020 : volume 34 / sponsored by the Association for the Advancement of Artificial Intelligence (2020) (AAAI Conference on Artificial Intelligence ; 34), Palo Alto, California USA, AAAI Press (2020), 13943-13944
Convolutional neural networks have become state-of-the-art in a wide range of image recognition tasks. The interpretation of their predictions, however, is an active area of research. Whereas various interpretation methods have been suggested for image classification, the interpretation of image segmentation still remains largely unexplored. To that end, we propose SEG-GRAD-CAM, a gradient-based method for interpreting semantic segmentation. Our method is an extension of the widely-used Grad-CAM method, applied locally to produce heatmaps showing the relevance of individual pixels for semantic segmentation.

Robert Haase✳︎, Loic Royer✳︎, Peter Steinbach, Deborah Schmidt, Alexandr Dibrov, Uwe Schmidt, Martin Weigert, Nicola Maghelli, Pavel Tomancak, Florian Jug, Eugene W Myers
CLIJ: GPU-accelerated image processing for everyone.
Nat Methods, 17(1) 5-6 (2020)

Martin Pippel✳︎, David Jebb✳︎, Franziska Patzold, Sylke Winkler, Heiko Vogel, Gene Myers, Michael Hiller#, Anna K Hundsdoerfer#
A highly contiguous genome assembly of the bat hawkmoth Hyles vespertilio (Lepidoptera: Sphingidae).
GigaScience, 9(1) Art. No. giaa001 (2020)
Open Access DOI
Adapted to different ecological niches, moth species belonging to the Hyles genus exhibit a spectacular diversity of larval color patterns. These species diverged ∼7.5 million years ago, making this rather young genus an interesting system to study a wide range of questions including the process of speciation, ecological adaptation, and adaptive radiation.

Kaushikaram Subramanian, Martin Weigert, Oliver Borsch, Heike Petzold, Alfonso Garcia-Ulloa, Eugene W Myers, Marius Ader, Irina Solovei, Moritz Kreysing
Rod nuclear architecture determines contrast transmission of the retina and behavioral sensitivity in mice.
Elife, 8 Art. No. e49542 (2019)
Open Access PDF DOI
Rod photoreceptors of nocturnal mammals display a striking inversion of nuclear architecture, which has been proposed as an evolutionary adaptation to dark environments. However, the nature of visual benefits and the underlying mechanisms remains unclear. It is widely assumed that improvements in nocturnal vision would depend on maximization of photon capture at the expense of image detail. Here we show that retinal optical quality improves 2-fold during terminal development, and that this enhancement is caused by nuclear inversion. We further demonstrate that improved retinal contrast transmission, rather than photon-budget or resolution, enhances scotopic contrast sensitivity by 18-27%, and improves motion detection capabilities up to 10-fold in dim environments. Our findings therefore add functional significance to a prominent exception of nuclear organization and establish retinal contrast transmission as a decisive determinant of mammalian visual perception.

Hanh Thi-Kim Vu, Sarah Mansour, Michael Kücken, Corinna Blasse, Claire Basquin, Juliette Azimzadeh, Eugene W Myers, Lutz Brusch#, Jochen Rink#
Dynamic Polarization of the Multiciliated Planarian Epidermis between Body Plan Landmarks.
Dev Cell, 51(4) 526-542 (2019)
Polarity is a universal design principle of biological systems that manifests at all organizational scales, yet its coordination across scales remains poorly understood. Here, we make use of the extreme anatomical plasticity of planarian flatworms to probe the interplay between global body plan polarity and local cell polarity. Our quantitative analysis of ciliary rootlet orientation in the epidermis reveals a dynamic polarity field with head and tail as independent determinants of anteroposterior (A/P) polarization and the body margin as determinant of mediolateral (M/L) polarization. Mathematical modeling rationalizes the global polarity field and its response to experimental manipulations as superposition of separate A/P and M/L fields, and we identify the core PCP and Ft/Ds pathways as their molecular mediators. Overall, our study establishes a framework for the alignment of cellular polarity vectors relative to planarian body plan landmarks and establishes the core PCP and Ft/Ds pathways as evolutionarily conserved 2D-polarization module.

Christoph Stritt, Michele Wyler, Elena L Gimmi, Martin Pippel, Anne C Roulin
Diversity, dynamics and effects of long terminal repeat retrotransposons in the model grass Brachypodium distachyon.
New Phytol, 227(6) 1736-1748 (2019)
Open Access DOI
Transposable elements (TEs) are the main reason for the high plasticity of plant genomes, where they occur as communities of diverse evolutionary lineages. Because research has typically focused on single abundant families or summarized TEs at a coarse taxonomic level, our knowledge about how these lineages differ in their effects on genome evolution is still rudimentary. Here we investigate the community composition and dynamics of 32 long terminal repeat retrotransposon (LTR-RT) families in the 272-Mb genome of the Mediterranean grass Brachypodium distachyon. We find that much of the recent transpositional activity in the B. distachyon genome is due to centromeric Gypsy families and Copia elements belonging to the Angela lineage. With a half-life as low as 66 kyr, the latter are the most dynamic part of the genome and an important source of within-species polymorphisms. Second, GC-rich Gypsy elements of the Retand lineage are the most abundant TEs in the genome. Their presence explains > 20% of the genome-wide variation in GC content and is associated with higher methylation levels. Our study shows how individual TE lineages change the genetic and epigenetic constitution of the host beyond simple changes in genome size.

Aaron M Wenger, Paul Peluso, William J Rowell, Pi-Chuan Chang, Richard J Hall, Gregory T Concepcion, Jana Ebler, Arkarachai Fungtammasan, Alexey Kolesnikov, Nathan D Olson, Armin Töpfer, Michael Alonge, Medhat Mahmoud, Yufeng Qian, Chen-Shan Chin, Adam M Phillippy, Michael C Schatz, Gene Myers, Mark A DePristo, Jue Ruan, Tobias Marschall, Fritz J Sedlazeck, Justin M Zook, Heng Li, Sergey Koren, Andrew Carroll, David R Rank, Michael W Hunkapiller
Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome.
Nat Biotechnol, 37(10) 1155-1162 (2019)
The DNA sequencing technologies in use today produce either highly accurate short reads or less-accurate long reads. We report the optimization of circular consensus sequencing (CCS) to improve the accuracy of single-molecule real-time (SMRT) sequencing (PacBio) and generate highly accurate (99.8%) long high-fidelity (HiFi) reads with an average length of 13.5 kilobases (kb). We applied our approach to sequence the well-characterized human HG002/NA24385 genome and obtained precision and recall rates of at least 99.91% for single-nucleotide variants (SNVs), 95.98% for insertions and deletions <50 bp (indels) and 95.99% for structural variants. Our CCS method matches or exceeds the ability of short-read sequencing to detect small variants and structural variants. We estimate that 2,434 discordances are correctable mistakes in the 'genome in a bottle' (GIAB) benchmark set. Nearly all (99.64%) variants can be phased into haplotypes, further improving variant detection. De novo genome assembly using CCS reads alone produced a contiguous and accurate genome with a contig N50 of >15 megabases (Mb) and concordance of 99.997%, substantially outperforming assembly with less-accurate long reads.

Falk Zakrzewski, Walter de Back, Martin Weigert, Torsten Wenke, Silke Zeugner, Robert Mantey, Christian Sperling, Katrin Friedrich, Ingo Roeder, Daniela Aust, Gustavo Baretton, Pia Hönscheid
Automated detection of the HER2 gene amplification status in Fluorescence in situ hybridization images for the diagnostics of cancer tissues.
Sci Rep, 9(1) Art. No. 8231 (2019)
Open Access DOI
The human epidermal growth factor receptor 2 (HER2) gene amplification status is a crucial marker for evaluating clinical therapies of breast or gastric cancer. We propose a deep learning-based pipeline for the detection, localization and classification of interphase nuclei depending on their HER2 gene amplification state in Fluorescence in situ hybridization (FISH) images. Our pipeline combines two RetinaNet-based object localization networks which are trained (1) to detect and classify interphase nuclei into distinct classes normal, low-grade and high-grade and (2) to detect and classify FISH signals into distinct classes HER2 or centromere of chromosome 17 (CEN17). By independently classifying each nucleus twice, the two-step pipeline provides both robustness and interpretability for the automated detection of the HER2 amplification status. The accuracy of our deep learning-based pipeline is on par with that of three pathologists and a set of 57 validation images containing several hundreds of nuclei are accurately classified. The automatic pipeline is a first step towards assisting pathologists in evaluating the HER2 status of tumors using FISH images, for analyzing FISH images in retrospective studies, and for optimizing the documentation of each tumor sample by automatically annotating and reporting of the HER2 gene amplification specificities.

German Tischler-Höhle
Haplotype and repeat separation in long reads.
In: Computational Intelligence Methods for Bioinformatics and Biostatistics : 14th International Meeting, CIBB 2017, Cagliari, Italy, September 7-9, 2017, Revised Selected Papers (2019)(Eds.) Massimo Bartoletti (Lecture Notes in Computer Science ;10834), Cham, Springer International Publishing (2019), 103-114

Juliana G. Roscito, Katrin Sameith, Martin Pippel, Kees-Jan Francoijs, Sylke Winkler, Andreas Dahl, Georg Papoutsoglou, Gene Myers, Michael Hiller
The genome of the tegu lizard Salvator merianae: combining Illumina, PacBio, and optical mapping data to generate a highly contiguous assembly.
GigaScience, 7(12) Art. No. giy141 (2018)
Open Access PDF DOI
Reptiles are a species-rich group with great phenotypic and life history diversity but are highly underrepresented among the vertebrate species with sequenced genomes.

Martin Weigert, Uwe Schmidt, Tobias Boothe, Andreas Müller, Alexandr Dibrov, Akanksha Jain, Benjamin Wilhelm, Deborah Schmidt, Coleman Broaddus, Sian Culley, Mauricio Rocha-Martins, Fabián Segovia-Miranda, Caren Norden, Ricardo Henriques, Marino Zerial, Michele Solimena, Jochen Rink, Pavel Tomancak, Loic Royer, Florian Jug, Eugene W Myers
Content-aware image restoration: pushing the limits of fluorescence microscopy.
Nat Methods, 15(12) 1090-1097 (2018)
Fluorescence microscopy is a key driver of discoveries in the life sciences, with observable phenomena being limited by the optics of the microscope, the chemistry of the fluorophores, and the maximum photon exposure tolerated by the sample. These limits necessitate trade-offs between imaging speed, spatial resolution, light exposure, and imaging depth. In this work we show how content-aware image restoration based on deep learning extends the range of biological phenomena observable by microscopy. We demonstrate on eight concrete examples how microscopy images can be restored even if 60-fold fewer photons are used during acquisition, how near isotropic resolution can be achieved with up to tenfold under-sampling along the axial direction, and how tubular and granular structures smaller than the diffraction limit can be resolved at 20-times-higher frame rates compared to state-of-the-art methods. All developed image restoration methods are freely available as open source software in Python, FIJI, and KNIME.

Liyuan Sui, Silvanus Alt, Martin Weigert, Natalie Dye, Suzanne Eaton, Florian Jug, Eugene W Myers, Frank Jülicher, Guillaume Salbreux, Christian Dahmann
Differential lateral and basal tension drive folding of Drosophila wing discs through two distinct mechanisms.
Nat Commun, 9(1) Art. No. 4620 (2018)
Open Access DOI
Epithelial folding transforms simple sheets of cells into complex three-dimensional tissues and organs during animal development. Epithelial folding has mainly been attributed to mechanical forces generated by an apically localized actomyosin network, however, contributions of forces generated at basal and lateral cell surfaces remain largely unknown. Here we show that a local decrease of basal tension and an increased lateral tension, but not apical constriction, drive the formation of two neighboring folds in developing Drosophila wing imaginal discs. Spatially defined reduction of extracellular matrix density results in local decrease of basal tension in the first fold; fluctuations in F-actin lead to increased lateral tension in the second fold. Simulations using a 3D vertex model show that the two distinct mechanisms can drive epithelial folding. Our combination of lateral and basal tension measurements with a mechanical tissue model reveals how simple modulations of surface and edge tension drive complex three-dimensional morphological changes.

Uwe Schmidt✳︎, Martin Weigert✳︎, Coleman Broaddus, Gene Myers
Cell Detection with Star-Convex Polygons.
In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2018 : 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part II (2018)(Eds.) Alejandro F. Frangi Lecture Notes in Computer Science ; 11071, Cham, Springer International Publishing (2018), 265-273

Djamal Belazzougui, Fabio Cunial, Olgert Denas
Fast matching statistics in small space.
In: 17th International Symposium on Experimental Algorithms (SEA 2018) (2018) Ch. 17(Eds.) Gianlorenzo D'Angelo (LIPIcs - Leibniz International Proceedings in Informatics ; Volume 103), Wadern, Schloss Dagstuhl - Leibniz-Zentrum für Informatik GmbH (2018), 1701-1714

Martin Weigert, Kaushikaram Subramanian, Sebastian T. Bundschuh, Eugene W Myers, Moritz Kreysing
Biobeam-Multiplexed wave-optical simulations of light-sheet microscopy.
PLoS Comput Biol, 14(4) Art. No. e1006079 (2018)
Open Access PDF DOI
Sample-induced image-degradation remains an intricate wave-optical problem in light-sheet microscopy. Here we present biobeam, an open-source software package that enables simulation of operational light-sheet microscopes by combining data from 105-106 multiplexed and GPU-accelerated point-spread-function calculations. The wave-optical nature of these simulations leads to the faithful reproduction of spatially varying aberrations, diffraction artifacts, geometric image distortions, adaptive optics, and emergent wave-optical phenomena, and renders image-formation in light-sheet microscopy computationally tractable.

Emma Teeling, Sonja Vernes, Liliana M Dávalos, David A Ray, M Thomas P Gilbert, Eugene Myers, Bat1K Consortium
Bat Biology, Genomes, and the Bat1K Project: To Generate Chromosome-Level Genomes for All Living Bat Species.
Annu Rev Anim Biosci, 6 23-46 (2018)
Bats are unique among mammals, possessing some of the rarest mammalian adaptations, including true self-powered flight, laryngeal echolocation, exceptional longevity, unique immunity, contracted genomes, and vocal learning. They provide key ecosystem services, pollinating tropical plants, dispersing seeds, and controlling insect pest populations, thus driving healthy ecosystems. They account for more than 20% of all living mammalian diversity, and their crown-group evolutionary history dates back to the Eocene. Despite their great numbers and diversity, many species are threatened and endangered. Here we announce Bat1K, an initiative to sequence the genomes of all living bat species (n∼1,300) to chromosome-level assembly. The Bat1K genome consortium unites bat biologists (>148 members as of writing), computational scientists, conservation organizations, genome technologists, and any interested individuals committed to a better understanding of the genetic and evolutionary mechanisms that underlie the unique adaptations of bats. Our aim is to catalog the unique genetic diversity present in all living bats to better understand the molecular basis of their unique adaptations; uncover their evolutionary history; link genotype with phenotype; and ultimately better understand, promote, and conserve bats. Here we review the unique adaptations of bats and highlight how chromosome-level genome assemblies can uncover the molecular basis of these traits. We present a novel sequencing and assembly strategy and review the striking societal and scientific benefits that will result from the Bat1K initiative.

Markus Grohme✳︎, Siegfried Schloissnig✳︎, Andrei Rozanski, Martin Pippel, George Robert Young, Sylke Winkler, Holger Brandl, Ian Henry, Andreas Dahl, Sean Powell, Michael Hiller, Eugene Myers, Jochen Rink
The genome of Schmidtea mediterranea and the evolution of core cellular mechanisms.
Nature, 554(7690) 56-61 (2018)
Open Access PDF DOI
The planarian Schmidtea mediterranea is an important model for stem cell research and regeneration, but adequate genome resources for this species have been lacking. Here we report a highly contiguous genome assembly of S. mediterranea, using long-read sequencing and a de novo assembler (MARVEL) enhanced for low-complexity reads. The S. mediterranea genome is highly polymorphic and repetitive, and harbours a novel class of giant retroelements. Furthermore, the genome assembly lacks a number of highly conserved genes, including critical components of the mitotic spindle assembly checkpoint, but planarians maintain checkpoint function. Our genome assembly provides a key model system resource that will be useful for studying regeneration and the evolutionary plasticity of core cell biological mechanisms.

Sergej Nowoshilow, Siegfried Schloissnig, Jifeng Fei, Andreas Dahl, Andy W C Pang, Martin Pippel, Sylke Winkler, Alex R Hastie, George Young, Juliana G. Roscito, Francisco Falcon, Dunja Knapp, Sean Powell, Alfredo Cruz, Han Cao, Bianca Habermann, Michael Hiller, Elly M. Tanaka, Eugene W Myers
The axolotl genome and the evolution of key tissue formation regulators.
Nature, 554(7690) 50-55 (2018)
Open Access PDF DOI
Salamanders serve as important tetrapod models for developmental, regeneration and evolutionary studies. An extensive molecular toolkit makes the Mexican axolotl (Ambystoma mexicanum) a key representative salamander for molecular investigations. Here we report the sequencing and assembly of the 32-gigabase-pair axolotl genome using an approach that combined long-read sequencing, optical mapping and development of a new genome assembler (MARVEL). We observed a size expansion of introns and intergenic regions, largely attributable to multiplication of long terminal repeat retroelements. We provide evidence that intron size in developmental genes is under constraint and that species-restricted genes may contribute to limb regeneration. The axolotl genome assembly does not contain the essential developmental gene Pax3. However, mutation of the axolotl Pax3 paralogue Pax7 resulted in an axolotl phenotype that was similar to those seen in Pax3-/-and Pax7-/-mutant mice. The axolotl genome provides a rich biological resource for developmental and evolutionary studies.

Matthias Kaiser, Florian Jug, Thomas Julou, Siddharth Deshpande, Thomas Pfohl, Olin Silander#, Gene Myers#, Erik van Nimwegen#
Monitoring single-cell gene regulation under dynamically controllable conditions with integrated microfluidics and software.
Nat Commun, 9(1) Art. No. 212 (2018)
Open Access DOI
Much is still not understood about how gene regulatory interactions control cell fate decisions in single cells, in part due to the difficulty of directly observing gene regulatory processes in vivo. We introduce here a novel integrated setup consisting of a microfluidic chip and accompanying analysis software that enable long-term quantitative tracking of growth and gene expression in single cells. The dual-input Mother Machine (DIMM) chip enables controlled and continuous variation of external conditions, allowing direct observation of gene regulatory responses to changing conditions in single cells. The Mother Machine Analyzer (MoMA) software achieves unprecedented accuracy in segmenting and tracking cells, and streamlines high-throughput curation with a novel leveraged editing procedure. We demonstrate the power of the method by uncovering several novel features of an iconic gene regulatory program: the induction of Escherichia coli's lac operon in response to a switch from glucose to lactose.

Djamal Belazzougui, Fabio Cunial
A Framework for Space-Efficient String Kernels.
Algorithmica, 79(3) 857-883 (2017)
String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact kernels on pairs of strings of total length n, like the k-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in O(nd) time and in o(n) bits of space in addition to the input, using just a rangeDistinct data structure on the Burrows-Wheeler transform of the input strings that takes O(d) time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple values of k, like the k-mer profile and the k-th order empirical entropy, and for calibrating the value of k using the data. All such algorithms become O(n) using a suitable implementation of the rangeDistinct data structure, and by concatenating them to a suitable BWT construction algorithm, we can compute all the mentioned kernels and complexity measures, directly from the input strings, in O(n) time and in (n log sigma) bits of space in addition to the input, where sigma is the size of the alphabet. Using similar data structures, we also show how to build a compact representation of the variable-length Markov chain of a string T of length n, that takes just 3n log sigma + o(n log sigma) bits of space, and that can be learnt in randomized O(n) time using O(n log sigma) bits of space in addition to the input. Such model can then be used to assign a probability to a query string S of length m in O(m) time and in 2m + o(m) bits of additional space, thus providing an alternative, compositional measure of the similarity between S and T that does not require alignment.

Jacob Kruse, Carsten Rother, Uwe Schmidt
Learning to Push the Limits of Efficient FFT-based Image Deconvolution
In: 2017 IEEE International Conference on Computer Vision : ICCV 2017 : proceedings : 22-29 October 2017, Venice, Italy (2017), Piscataway, N.J., IEEE (2017), 4596-4604
This work addresses the task of non-blind image deconvolution. Motivated to keep up with the constant increase in image size, with megapixel images becoming the norm, we aim at pushing the limits of efficient FFT-based techniques. Based on an analysis of traditional and more recent learning-based methods, we generalize existing discriminative approaches by using more powerful regularization, based on convolutional neural networks. Additionally, we propose a simple, yet effective, boundary adjustment method that alleviates the problematic circular convolution assumption, which is necessary for FFT-based deconvolution. We evaluate our approach on two common non-blind deconvolution benchmarks and achieve state-of-the-art results even when including methods which are computationally considerably more expensive.

Markus Rempfler✳︎, Jan-Hendrik Lange✳︎, Florian Jug, Corinna Blasse, Eugene W Myers, Bjoern H. Menze, Bjoern Andres
Efficient Algorithms for Moral Lineage Tracing
In: 2017 IEEE International Conference on Computer Vision : ICCV 2017 : proceedings : 22-29 October 2017, Venice, Italy (2017), Piscataway, N.J., IEEE (2017), 4705-4714
Lineage tracing, the joint segmentation and tracking of living cells as they move and divide in a sequence of light microscopy images, is a challenging task. Jug et al. [21] have proposed a mathematical abstraction of this task, the moral lineage tracing problem (MLTP), whose feasible solutions define both a segmentation of every image and a lineage forest of cells. Their branch-and-cut algorithm, however, is prone to many cuts and slow convergence for large instances. To address this problem, we make three contributions: (i) we devise the first efficient primal feasible local search algorithms for the MLTP, (ii) we improve the branch-and-cut algorithm by separating tighter cutting planes and by incorporating our primal algorithms, (iii) we show in experiments that our algorithms find accurate solutions on the problem instances of Jug et al. and scale to larger instances, leveraging moral lineage tracing to practical significance.

Natalie Dye, Marko Popović, Stephanie Spannl, Raphael Etournay, Dagmar Kainmüller, Suhrid Ghosh, Eugene W Myers, Frank Jülicher#, Suzanne Eaton#
Cell dynamics underlying oriented growth of the Drosophila wing imaginal disc.
Development, 144(23) 4406-4421 (2017)
Quantitative analysis of the dynamic cellular mechanisms shaping the Drosophila wing during its larval growth phase has been limited, impeding our ability to understand how morphogen patterns regulate tissue shape. Such analysis requires explants to be imaged under conditions that maintain both growth and patterning, as well as methods to quantify how much cellular behaviors change tissue shape. Here, we demonstrate a key requirement for the steroid hormone 20-hydroxyecdysone (20E) in the maintenance of numerous patterning systems in vivo and in explant culture. We find that low concentrations of 20E support prolonged proliferation in explanted wing discs in the absence of insulin, incidentally providing novel insight into the hormonal regulation of imaginal growth. We use 20E-containing media to observe growth directly and to apply recently developed methods for quantitatively decomposing tissue shape changes into cellular contributions. We discover that whereas cell divisions drive tissue expansion along one axis, their contribution to expansion along the orthogonal axis is cancelled by cell rearrangements and cell shape changes. This finding raises the possibility that anisotropic mechanical constraints contribute to growth orientation in the wing disc.

Martin Weigert, Loic Royer, Florian Jug, Gene Myers
Isotropic reconstruction of 3D fluorescence microscopy images using convolutional neural networks
In: Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2017 : 20th International Conference, Quebec City, QC, Canada, September 10-14, 2017, Proceedings, Part II (2017)(Eds.) Maxime Descoteaux (Lecture Notes in Computer Science ; 10434), Cham, Springer International Publishing (2017), 126-134
Fluorescence microscopy images usually show severe anisotropy in axial versus lateral resolution. This hampers downstream processing, i.e. the automatic extraction of quantitative biological data. While deconvolution methods and other techniques to address this problem exist, they are either time consuming to apply or limited in their ability to remove anisotropy. We propose a method to recover isotropic resolution from readily acquired anisotropic data. We achieve this using a convolutional neural network that is trained end-to-end from the same anisotropic body of data we later apply the network to. The network effectively learns to restore the full isotropic resolution by restoring the image under a trained, sample specific image prior. We apply our method to 3 synthetic and 3 real datasets and show that our results improve on results from deconvolution and state-of-the-art super-resolution techniques. Finally, we demonstrate that a standard 3D segmentation pipeline performs on the output of our network with comparable accuracy as on the full isotropic data. © Springer International Publishing AG 2017.

Corinna Blasse#, Stephan Saalfeld, Raphael Etournay, Andreas Sagner, Suzanne Eaton, Eugene W Myers#
PreMosa: extracting 2D surfaces from 3D microscopy mosaics.
Bioinformatics, 33(16) 2563-2569 (2017)
A significant focus of biological research is to understand the development, organization and function of tissues. A particularly productive area of study is on single layer epithelial tissues in which the adherence junctions of cells form a 2D manifold that is fluorescently labeled. Given the size of the tissue, a microscope must collect a mosaic of overlapping 3D stacks encompassing the stained surface. Downstream interpretation is greatly simplified by preprocessing such a dataset as follows: (i) extracting and mapping the stained manifold in each stack into a single 2D projection plane, (ii) correcting uneven illumination artifacts, (iii) stitching the mosaic planes into a single, large 2D image and (iv) adjusting the contrast.

Tobias Boothe, Lennart Hilbert, Michael Heide, Lea Berninger, Wieland B. Huttner, Vasily Zaburdaev, Nadine Vastenhouw, Eugene W Myers, David N. Drechsel, Jochen Rink
A tunable refractive index matching medium for live imaging cells, tissues and model organisms.
Elife, 6 Art. No. e27240 (2017)
Open Access DOI
In light microscopy, refractive index mismatches between media and sample cause spherical aberrations that often limit penetration depth and resolution. Optical clearing techniques can alleviate these mismatches, but they are so far limited to fixed samples. We present Iodixanol as a non-toxic medium supplement that allows refractive index matching in live specimens and thus substantially improves image quality in live-imaged primary cell cultures, planarians, zebrafish and human cerebral organoids.

Richard Grunzke✳︎#, Florian Jug✳︎#, Bernd Schuller, René Jäkel, Gene Myers, Wolfgang E Nagel
Seamless HPC Integration of Data-Intensive KNIME Workflows via UNICORE
In: Euro-Par 2016: Parallel Processing Workshops : Euro-Par 2016 International Workshops, Grenoble, France, August 24-26, 2016, Revised Selected Papers (2017)(Eds.) Frédéric Desprez (Lecture Notes in Computer Science ; 10104), Cham, Springer International Publishing (2017), 480-491

Francesca Giordano, Louise Aigrain, Michael A Quail, Paul Coupland, James K Bonfield, Robert M Davies, German Tischler, David K Jackson, Thomas M Keane, Jing Li, Jia-Xing Yue, Gianni Liti, Richard Durbin, Zemin Ning
De novo yeast genome assemblies from MinION, PacBio and MiSeq platforms.
Sci Rep, 7(1) Art. No. 3935 (2017)
Open Access DOI
Long-read sequencing technologies such as Pacific Biosciences and Oxford Nanopore MinION are capable of producing long sequencing reads with average fragment lengths of over 10,000 base-pairs and maximum lengths reaching 100,000 base- pairs. Compared with short reads, the assemblies obtained from long-read sequencing platforms have much higher contig continuity and genome completeness as long fragments are able to extend paths into problematic or repetitive regions. Many successful assembly applications of the Pacific Biosciences technology have been reported ranging from small bacterial genomes to large plant and animal genomes. Recently, genome assemblies using Oxford Nanopore MinION data have attracted much attention due to the portability and low cost of this novel sequencing instrument. In this paper, we re-sequenced a well characterized genome, the Saccharomyces cerevisiae S288C strain using three different platforms: MinION, PacBio and MiSeq. We present a comprehensive metric comparison of assemblies generated by various pipelines and discuss how the platform associated data characteristics affect the assembly quality. With a given read depth of 31X, the assemblies from both Pacific Biosciences and Oxford Nanopore MinION show excellent continuity and completeness for the 16 nuclear chromosomes, but not for the mitochondrial genome, whose reconstruction still represents a significant challenge.

Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, Mathieu Raffinot
Flexible indexing of repetitive collections
In: Unveiling Dynamics and Complexity : 13th Conference on Computability in Europe, CiE 2017, Turku, Finland, June 12-16, 2017, Proceedings (2017)(Eds.) Jarkko Kari (Lecture Notes in Computer Science ; 10307), Cham, Springer International Publishing (2017), 162-174
Highly repetitive strings are increasingly being amassed by genome sequencing experiments, and by versioned archives of source code and webpages. We describe practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors. One such variant uses an amount of space comparable to LZ77 indexes, but it answers count queries between two and four orders of magnitude faster than all LZ77 and hybrid index implementations, at the cost of slower locate queries. Combining the RLBWT with the compact directed acyclic word graph answers locate queries for short patterns between four and ten times faster than a version of the run-length compressed suffix array (RLCSA) that uses comparablememory, and with very short patterns our index achieves speedups even greater than ten with respect to RLCSA

Dmitrij Schlesinger✳︎, Florian Jug✳︎, Gene Myers, Carsten Rother, Dagmar Kainmueller
Crowd Sourcing Image Segmentation with iaSTAPLE
In: 14th. IEEE International Symposium on Biomedical Imaging: From Nano to Macro ; ISBI 2017 ; Proceedings (2017)(Eds.) Garry Egan, Piscataway, N.J., IEEE (2017), 401-405
We propose a novel label fusion technique as well as a crowdsourcing protocol to efficiently obtain accurate epithelial cell segmentations from non-expert crowd workers. Our label fusion technique simultaneously estimates the true segmentation, the performance levels of individual crowd workers, and an image segmentation model in the form of a pairwise Markov random field. We term our approach image-aware STAPLE (iaSTAPLE) since our image segmentation model seamlessly integrates into the well-known and widely used STAPLE approach. In an evaluation on a light microscopy dataset containing more than 5000 membrane labeled epithelial cells of a fly wing, we show that iaSTAPLE outperforms STAPLE in terms of segmentation accuracy as well as in terms of the accuracy of estimated crowd worker performance levels, and is able to correctly segment 99% of all cells when compared to expert segmentations. These results show that iaSTAPLE is a highly useful tool for crowd sourcing image segmentation. © 2017 IEEE.

Martin Weigert, Eugene W Myers, Moritz Kreysing
Biobeam - Rigorous wave-optical simulations of light-sheet microscopy
arXiv, Art. No. arXiv:1706.02261 (2017)
Open Access PDF

Jarno Alanko, Fabio Cunial, Djamal Belazzougui, Veli Mäkinen
A framework for space-efficient read clustering in metagenomic samples.
BMC Bioinformatics, 18(Suppl 3) Art. No. 59 (2017)
Open Access PDF DOI
A metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species. Unsupervised metagenomic clustering aims at partitioning a metagenomic sample into sets that approximate taxonomic units, without using reference genomes. Since samples are large and steadily growing, space-efficient clustering algorithms are strongly needed.

Stefan Leger, Steffen Löck, Volker Hietschold, Robert Haase, Hans Joachim Böhme, Nasreddin Abolmaali
Physical correction model for automatic correction of intensity non-uniformity in magnetic resonance imaging.
Physics and Imaging in Radiation Oncology, 4 32-38 (2017)
Open Access DOI
Background and purpose: Magnetic resonance imaging (MRI) plays an important role in the field of MR-guided radiotherapy or personalised radiation oncology. The application of quantitative image analyses like radiomics as well as automated tissue characterisation is frequently disturbed by the effect of intensity non-uniformity. We present a novel fully automated physical correction model (PCM) for the reduction of intensity non-uniformity. Materials and methods: The proposed algorithm is based on a 3D physically motivated correction model, which maximises the image information expressed by the Shannon entropy. The PCM was evaluated using the coefficient of variation (cv) on 176 MRI datasets of the human brain and abdomen acquired on 1.5 Tesla and 3 Tesla MR scanners. The resulting cv was compared to the cv of the original images and to the results of the established N4 algorithm. Results: The PCM algorithm significantly improved the image quality of all considered 1.5 and 3.0 Tesla MR scans compared to the original images (p <.01). Furthermore, the PCM outperformed or competed with the N4 algorithm in terms of image quality. Additionally, the PCM approach preserved the tissue signal of different tissue types due to smooth correction gradients. Conclusion: The proposed PCM algorithm led to a significantly improved image quality compared to the originally acquired images, suggesting that it is applicable to the correction of MRI data. Thus it may help to reduce intensity non-uniformity which is an important step for advanced image analysis.

Loic Royer, William Lemon, Raghav K Chhetri, Yinan Wan, Michael Coleman, Eugene W Myers, Patrick Keller
Adaptive light-sheet microscopy for long-term, high-resolution imaging in living organisms.
Nat Biotechnol, 34(12) 1267-1278 (2016)
Optimal image quality in light-sheet microscopy requires a perfect overlap between the illuminating light sheet and the focal plane of the detection objective. However, mismatches between the light-sheet and detection planes are common owing to the spatiotemporally varying optical properties of living specimens. Here we present the AutoPilot framework, an automated method for spatiotemporally adaptive imaging that integrates (i) a multi-view light-sheet microscope capable of digitally translating and rotating light-sheet and detection planes in three dimensions and (ii) a computational method that continuously optimizes spatial resolution across the specimen volume in real time. We demonstrate long-term adaptive imaging of entire developing zebrafish (Danio rerio) and Drosophila melanogaster embryos and perform adaptive whole-brain functional imaging in larval zebrafish. Our method improves spatial resolution and signal strength two to five-fold, recovers cellular and sub-cellular structures in many regions that are not resolved by non-adaptive imaging, adapts to spatiotemporal dynamics of genetically encoded fluorescent markers and robustly optimizes imaging performance during large-scale morphogenetic changes in living organisms.

Corinna Blasse
Towards accurate and efficient cell tracking during fly wing development
Ph.D. Thesis,Technische Universität Dresden, Dresden, Germany (2016)

German Tischler
Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array
In: String Processing and Information Retrieval : 23rd International Symposium, SPIRE 2016, Beppu, Japan, October 18-20, 2016, Proceedings (2016) Lecture notes in computer science ; 9954, Cham, Springer International Publishing (2016), 178-190

Kaiser Matthias, Florian Jug, Olin Silander, Siddharth Deshpande, Thomas Pfohl, Thomas Julou#, Gene Myers#, Erik van Nimwegen#
Tracking single-cell gene regulation in dynamically controlled environments using an integrated microfluidic and computational setup
bioRxiv, Art. No. https://doi.org/10.1101/076224 (2016)
Open Access DOI

Florian Jug, Evgeny Levinkov, Corinna Blasse, Eugene W Myers, Bjoern Andres
Moral Lineage Tracing
In: Proceedings 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016) (2016), Piscataway, N.J., IEEE (2016), 5926-5935

L Carine Stapel, Benoit Lombardot, Coleman Broaddus, Dagmar Kainmueller, Florian Jug, Eugene W Myers, Nadine Vastenhouw
Automated detection and quantification of single RNAs at cellular resolution in zebrafish embryos.
Development, 143(3) 540-546 (2016)
Open Access PDF DOI
Analysis of differential gene expression is crucial for the study of cell fate and behavior during embryonic development. However, automated methods for the sensitive detection and quantification of RNAs at cellular resolution in embryos are lacking. With the advent of single-molecule fluorescence in situ hybridization (smFISH), gene expression can be analyzed at single-molecule resolution. However, the limited availability of protocols for smFISH in embryos and the lack of efficient image analysis pipelines have hampered quantification at the (sub)cellular level in complex samples such as tissues and embryos. Here, we present a protocol for smFISH on zebrafish embryo sections in combination with an image analysis pipeline for automated transcript detection and cell segmentation. We use this strategy to quantify gene expression differences between different cell types and identify differences in subcellular transcript localization between genes. The combination of our smFISH protocol and custom-made, freely available, analysis pipeline will enable researchers to fully exploit the benefits of quantitative transcript analysis at cellular and subcellular resolution in tissues and embryos.

Michael N Economo, Nathan G Clack, Luke D Lavis, Charles R Gerfen, Karel Svoboda, Eugene W Myers, Jayaram Chandrashekar
A platform for brain-wide imaging and reconstruction of individual neurons.
Elife, 5 Art. No. e10566 (2016)
Open Access PDF DOI
The structure of axonal arbors controls how signals from individual neurons are routed within the mammalian brain. However, the arbors of very few long-range projection neurons have been reconstructed in their entirety, as axons with diameters as small as 100 nm arborize in target regions dispersed over many millimeters of tissue. We introduce a platform for high-resolution, three-dimensional fluorescence imaging of complete tissue volumes that enables the visualization and reconstruction of long-range axonal arbors. This platform relies on a high-speed two-photon microscope integrated with a tissue vibratome and a suite of computational tools for large-scale image data. We demonstrate the power of this approach by reconstructing the axonal arbors of multiple neurons in the motor cortex across a single mouse brain.

David Richmond, Dagmar Kainmueller, Ben Glocker, Carsten Rother, Gene Myers
Uncertainty-Driven Forest Predictors for Vertebra Localization and Segmentation
In: Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2015 : 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part I (2015) Lecture Notes in Computer Science ; 9349, Cham, Springer International Publishing (2015), 653-660

Yinan Wan, Fuhui Long, Lei Qu, Hang Xiao, Michael Hawrylycz, Eugene W Myers, Hanchuan Peng
BlastNeuron for Automated Comparison, Retrieval and Clustering of 3D Neuron Morphologies.
Neuroinformatics, 13(4) 487-499 (2015)
Characterizing the identity and types of neurons in the brain, as well as their associated function, requires a means of quantifying and comparing 3D neuron morphology. Presently, neuron comparison methods are based on statistics from neuronal morphology such as size and number of branches, which are not fully suitable for detecting local similarities and differences in the detailed structure. We developed BlastNeuron to compare neurons in terms of their global appearance, detailed arborization patterns, and topological similarity. BlastNeuron first compares and clusters 3D neuron reconstructions based on global morphology features and moment invariants, independent of their orientations, sizes, level of reconstruction and other variations. Subsequently, BlastNeuron performs local alignment between any pair of retrieved neurons via a tree-topology driven dynamic programming method. A 3D correspondence map can thus be generated at the resolution of single reconstruction nodes. We applied BlastNeuron to three datasets: (1) 10,000+ neuron reconstructions from a public morphology database, (2) 681 newly and manually reconstructed neurons, and (3) neurons reconstructions produced using several independent reconstruction methods. Our approach was able to accurately and efficiently retrieve morphologically and functionally similar neuron structures from large morphology database, identify the local common structures, and find clusters of neurons that share similarities in both morphology and molecular profiles.

Avinash Patel, Hyun-Ok Kate Lee, Louise Jawerth, Shovamayee Maharana, Marcus Jahnel, Marco Y Hein, Stoyno Stoynov, Julia Mahamid, Shambaditya Saha, Titus Franzmann, Andrei Pozniakovski, Ina Poser, Nicola Maghelli, Loic Royer, Martin Weigert, Eugene W Myers, Stephan W. Grill, David N. Drechsel, Anthony Hyman, Simon Alberti
A Liquid-to-Solid Phase Transition of the ALS Protein FUS Accelerated by Disease Mutation.
Cell, 162(5) 1066-1077 (2015)
Many proteins contain disordered regions of low-sequence complexity, which cause aging-associated diseases because they are prone to aggregate. Here, we study FUS, a prion-like protein containing intrinsically disordered domains associated with the neurodegenerative disease ALS. We show that, in cells, FUS forms liquid compartments at sites of DNA damage and in the cytoplasm upon stress. We confirm this by reconstituting liquid FUS compartments in vitro. Using an in vitro "aging" experiment, we demonstrate that liquid droplets of FUS protein convert with time from a liquid to an aggregated state, and this conversion is accelerated by patient-derived mutations. We conclude that the physiological role of FUS requires forming dynamic liquid-like compartments. We propose that liquid-like compartments carry the trade-off between functionality and risk of aggregation and that aberrant phase transitions within liquid-like compartments lie at the heart of ALS and, presumably, other age-related diseases. VIDEO ABSTRACT.

Loic Royer, Martin Weigert, Ulrik Günther, Nicola Maghelli, Florian Jug, Ivo F. Sbalzarini, Eugene W Myers
ClearVolume: open-source live 3D visualization for light-sheet microscopy.
Nat Methods, 12(6) 480-481 (2015)

Carlas S Smith, Stephan Preibisch, Aviva Joseph, Sara Abrahamsson, Bernd Rieger, Eugene Myers, Robert H. Singer, David Grunwald
Nuclear accessibility of β-actin mRNA is measured by 3D single-molecule real-time tracking.
J Cell Biol, 209(4) 609-619 (2015)
Imaging single proteins or RNAs allows direct visualization of the inner workings of the cell. Typically, three-dimensional (3D) images are acquired by sequentially capturing a series of 2D sections. The time required to step through the sample often impedes imaging of large numbers of rapidly moving molecules. Here we applied multifocus microscopy (MFM) to instantaneously capture 3D single-molecule real-time images in live cells, visualizing cell nuclei at 10 volumes per second. We developed image analysis techniques to analyze messenger RNA (mRNA) diffusion in the entire volume of the nucleus. Combining MFM with precise registration between fluorescently labeled mRNA, nuclear pore complexes, and chromatin, we obtained globally optimal image alignment within 80-nm precision using transformation models. We show that β-actin mRNAs freely access the entire nucleus and fewer than 60% of mRNAs are more than 0.5 µm away from a nuclear pore, and we do so for the first time accounting for spatial inhomogeneity of nuclear organization.

Florian Jug, Tobias Pietzsch, Dagmar Kainmueller, Gene Myers
Tracking by Assignment Facilitates Data Curation
In: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2014 (2015) IMIC Interactive Medical Image Computing Workshop, Amsterdam, Netherlands, Academic Press (2015), 1-12

Hanchuan Peng, Eugene W Myers
Constructing 5D developing gene expression patterns without live animal imaging
Biomed Eng Lett, 4(4) 338-346 (2015)

Eva M Schmid, David Richmond, Daniel A Fletcher
Reconstitution of proteins on electroformed giant unilamellar vesicles.
Methods Cell Biol, 128 319-338 (2015)
In vitro reconstitution of simplified biological systems from molecular parts has proven to be a powerful method for investigating the biochemical and biophysical principles underlying cellular processes. In recent years, there has been a growing interest in reconstitution of protein-membrane interactions to understand the critical role played by membranes in organizing molecular-scale events into micron-scale patterns and protrusions. However, while all reconstitution experiments depend on identifying and isolating an essential set of soluble biomolecules, such as proteins, DNA, and RNA, reconstitution of membrane-based processes involves the additional challenge of forming and working with lipid bilayer membranes with composition, fluidity, and mechanical properties appropriate for the process at hand. Here we discuss a selection of methods for forming synthetic lipid bilayer membranes and present a versatile electroformation protocol that our lab uses for reconstituting proteins on giant unilamellar vesicles. This synthetic membrane-based approach to reconstitution offers the ability to study protein organization and activity at membranes under more cell-like conditions, addressing a central challenge to accomplishing the grand goal of "building the cell."

Raphael Etournay, Marko Popović, Matthias Merkel, Amitabha Nandi, Corinna Blasse, Benoit Aigouy, Holger Brandl, Gene Myers, Guillaume Salbreux, Frank Jülicher, Suzanne Eaton
Interplay of cell dynamics and epithelial tension during morphogenesis of the Drosophila pupal wing.
Elife, 4 Art. No. e07090 (2015)
Open Access PDF DOI
How tissue shape emerges from the collective mechanical properties and behavior of individual cells is not understood. We combine experiment and theory to study this problem in the developing wing epithelium of Drosophila. At pupal stages, the wing-hinge contraction contributes to anisotropic tissue flows that reshape the wing blade. Here, we quantitatively account for this wing-blade shape change on the basis of cell divisions, cell rearrangements and cell shape changes. We show that cells both generate and respond to epithelial stresses during this process, and that the nature of this interplay specifies the pattern of junctional network remodeling that changes wing shape. We show that patterned constraints exerted on the tissue by the extracellular matrix are key to force the tissue into the right shape. We present a continuum mechanical model that quantitatively describes the relationship between epithelial stresses and cell dynamics, and how their interplay reshapes the wing.

Gene Myers
Efficient Alignment Discovery amongst Noisy Long Reads
In: 14th International Workshop, WABI 2014, Wroclaw, Poland, September 8-10, 2014. Proceedings (2014) Lecture Notes in Bioinformatics, Vol. 8701, New York, Springer (2014), 52-67

Florian Jug, Tobias Pietzsch, Dagmar Kainmueller, Jan Funke, Matthias Kaiser, Erik van Nimwegen, Carsten Rother, Gene Myers
Optimal Joint Segmentation and Tracking of Escherichia Coli in the Mother Machine
In: Bayesian and grAphical models for biomedical imaging first international workshop, BAMBI 2014, Cambridge, MA, USA, September 18, 2014 ; revised selected papers (2014) Lecture Notes in Computer Science ; 8677, New York, Springer (2014), 25-36

Dagmar Kainmueller, Florian Jug, Carsten Rother, Eugene W Myers
Active Graph Matching for Automatic Joint Segmentation and Annotation of C. elegans
In: Medical Image Computing and Computer-Assisted Intervention - MICCAI 2014 : 17th International Conference, Boston, MA, USA, September 14-18, 2014, Proceedings, Part I (2014) Lecture Notes in Computer Science ; 8673 , New York, Springer (2014), 81-88

Matthias Merkel, Andreas Sagner, Franz Gruber, Raphael Etournay, Corinna Blasse, Gene Myers, Suzanne Eaton, Frank Jülicher
The Balance of Prickle/Spiny-Legs Isoforms Controls the Amount of Coupling between Core and Fat PCP Systems.
Curr Biol, 24(18) 2111-2123 (2014)
The conserved Fat and Core planar cell polarity (PCP) pathways work together to specify tissue-wide orientation of hairs and ridges in the Drosophila wing. Their components form intracellularly polarized complexes at adherens junctions that couple the polarity of adjacent cells and form global patterns. How Fat and Core PCP systems interact is not understood. Some studies suggest that Fat PCP directly orients patterns formed by Core PCP components. Others implicate oriented tissue remodeling in specifying Core PCP patterns.

Fernando Amat, William Lemon, Daniel P Mossing, Katie McDole, Yinan Wan, Kristin Branson, Eugene W Myers, Patrick Keller
Fast, accurate reconstruction of cell lineages from large-scale fluorescence microscopy data.
Nat Methods, 11(9) 951-958 (2014)
The comprehensive reconstruction of cell lineages in complex multicellular organisms is a central goal of developmental biology. We present an open-source computational framework for the segmentation and tracking of cell nuclei with high accuracy and speed. We demonstrate its (i) generality by reconstructing cell lineages in four-dimensional, terabyte-sized image data sets of fruit fly, zebrafish and mouse embryos acquired with three types of fluorescence microscopes, (ii) scalability by analyzing advanced stages of development with up to 20,000 cells per time point at 26,000 cells min(-1) on a single computer workstation and (iii) ease of use by adjusting only two parameters across all data sets and providing visualization and editing tools for efficient data curation. Our approach achieves on average 97.0% linkage accuracy across all species and imaging modalities. Using our system, we performed the first cell lineage reconstruction of early Drosophila melanogaster nervous system development, revealing neuroblast dynamics throughout an entire embryo.

Ellie S Heckscher, Fuhui Long, Michael J Layden, Chein-Hui Chuang, Laurina Manning, Jourdain Richart, Joseph C Pearson, Stephen T Crews, Hanchuan Peng, Gene Myers, Chris Q Doe
Atlas-builder software and the eNeuro atlas: resources for developmental biology and neuroscience.
Development, 141(12) 2524-2532 (2014)
A major limitation in understanding embryonic development is the lack of cell type-specific markers. Existing gene expression and marker atlases provide valuable tools, but they typically have one or more limitations: a lack of single-cell resolution; an inability to register multiple expression patterns to determine their precise relationship; an inability to be upgraded by users; an inability to compare novel patterns with the database patterns; and a lack of three-dimensional images. Here, we develop new 'atlas-builder' software that overcomes each of these limitations. A newly generated atlas is three-dimensional, allows the precise registration of an infinite number of cell type-specific markers, is searchable and is open-ended. Our software can be used to create an atlas of any tissue in any organism that contains stereotyped cell positions. We used the software to generate an 'eNeuro' atlas of the Drosophila embryonic CNS containing eight transcription factors that mark the major CNS cell types (motor neurons, glia, neurosecretory cells and interneurons). We found neuronal, but not glial, nuclei occupied stereotyped locations. We added 75 new Gal4 markers to the atlas to identify over 50% of all interneurons in the ventral CNS, and these lines allowed functional access to those interneurons for the first time. We expect the atlas-builder software to benefit a large proportion of the developmental biology community, and the eNeuro atlas to serve as a publicly accessible hub for integrating neuronal attributes - cell lineage, gene expression patterns, axon/dendrite projections, neurotransmitters--and linking them to individual neurons.

Stephan Preibisch#, Fernando Amat, Evangelia Stamataki, Mihail Sarov, Robert H. Singer, Gene Myers, Pavel Tomancak#
Efficient Bayesian-based multiview deconvolution.
Nat Methods, 11(6) 645-648 (2014)
Light-sheet fluorescence microscopy is able to image large specimens with high resolution by capturing the samples from multiple angles. Multiview deconvolution can substantially improve the resolution and contrast of the images, but its application has been limited owing to the large size of the data sets. Here we present a Bayesian-based derivation of multiview deconvolution that drastically improves the convergence time, and we provide a fast implementation using graphics hardware.

Florian Jug✳︎, Tobias Pietzsch✳︎, Stephan Preibisch, Pavel Tomancak
Bioimage Informatics in the context of Drosophila research.
Methods, 68(1) 60-73 (2014)
Modern biological research relies heavily on microscopic imaging. The advanced genetic toolkit of Drosophila makes it possible to label molecular and cellular components with unprecedented level of specificity necessitating the application of the most sophisticated imaging technologies. Imaging in Drosophila spans all scales from single molecules to the entire populations of adult organisms, from electron microscopy to live imaging of developmental processes. As the imaging approaches become more complex and ambitious, there is an increasing need for quantitative, computer-mediated image processing and analysis to make sense of the imagery. Bioimage Informatics is an emerging research field that covers all aspects of biological image analysis from data handling, through processing, to quantitative measurements, analysis and data presentation. Some of the most advanced, large scale projects, combining cutting edge imaging with complex bioimage informatics pipelines, are realized in the Drosophila research community. In this review, we discuss the current research in biological image analysis specifically relevant to the type of systems level image datasets that are uniquely available for the Drosophila model system. We focus on how state-of-the-art computer vision algorithms are impacting the ability of Drosophila researchers to analyze biological systems in space and time. We pay particular attention to how these algorithmic advances from computer science are made usable to practicing biologists through open source platforms and how biologists can themselves participate in their further development.

Hanchuan Peng, Jianyong Tang, Hang Xiao, Alessandro Bria, Jianlong Zhou, Victoria Butler, Zhi Zhou, Paloma T Gonzalez-Bellido, Seung W Oh, Jichao Chen, Aniruddha Mitra, Richard W Tsien, Hongkui Zeng, Giorgio A Ascoli, Giulio Iannello, Michael Hawrylycz, Eugene Myers, Fuhui Long
Virtual finger boosts three-dimensional imaging and microsurgery as well as terabyte volume image visualization and analysis.
Nat Commun, 5 Art. No. 4342 (2014)
Open Access PDF DOI
Three-dimensional (3D) bioimaging, visualization and data analysis are in strong need of powerful 3D exploration techniques. We develop virtual finger (VF) to generate 3D curves, points and regions-of-interest in the 3D space of a volumetric image with a single finger operation, such as a computer mouse stroke, or click or zoom from the 2D-projection plane of an image as visualized with a computer. VF provides efficient methods for acquisition, visualization and analysis of 3D images for roundworm, fruitfly, dragonfly, mouse, rat and human. Specifically, VF enables instant 3D optical zoom-in imaging, 3D free-form optical microsurgery, and 3D visualization and annotation of terabytes of whole-brain image volumes. VF also leads to orders of magnitude better efficiency of automated 3D reconstruction of neurons and similar biostructures over our previous systems. We use VF to generate from images of 1,107 Drosophila GAL4 lines a projectome of a Drosophila brain.

Jong-Cheol Rah, Erhan Bas, Jennifer Colonell, Yuriy Mishchenko, Bill Karsh, Richard D Fetter, Eugene W Myers, Dmitri B Chklovskii, Karel Svoboda, Timothy D Harris, John T R Isaac
Thalamocortical input onto layer 5 pyramidal neurons measured using quantitative large-scale array tomography.
Front Neural Circuits, 7 Art. No. 177 (2013)
Open Access PDF DOI
The subcellular locations of synapses on pyramidal neurons strongly influences dendritic integration and synaptic plasticity. Despite this, there is little quantitative data on spatial distributions of specific types of synaptic input. Here we use array tomography (AT), a high-resolution optical microscopy method, to examine thalamocortical (TC) input onto layer 5 pyramidal neurons. We first verified the ability of AT to identify synapses using parallel electron microscopic analysis of TC synapses in layer 4. We then use large-scale array tomography (LSAT) to measure TC synapse distribution on L5 pyramidal neurons in a 1.00 × 0.83 × 0.21 mm(3) volume of mouse somatosensory cortex. We found that TC synapses primarily target basal dendrites in layer 5, but also make a considerable input to proximal apical dendrites in L4, consistent with previous work. Our analysis further suggests that TC inputs are biased toward certain branches and, within branches, synapses show significant clustering with an excess of TC synapse nearest neighbors within 5-15 μm compared to a random distribution. Thus, we show that AT is a sensitive and quantitative method to map specific types of synaptic input on the dendrites of entire neurons. We anticipate that this technique will be of wide utility for mapping functionally-relevant anatomical connectivity in neural circuits.

Saket Navlakha✳︎#, Parvez Ahammad✳︎#, Eugene W Myers
Unsupervised segmentation of noisy electron microscopy images using salient watersheds and region merging.
BMC Bioinformatics, 14 Art. No. 294 (2013)
Open Access PDF DOI
Segmenting electron microscopy (EM) images of cellular and subcellular processes in the nervous system is a key step in many bioimaging pipelines involving classification and labeling of ultrastructures. However, fully automated techniques to segment images are often susceptible to noise and heterogeneity in EM images (e.g. different histological preparations, different organisms, different brain regions, etc.). Supervised techniques to address this problem are often helpful but require large sets of training data, which are often difficult to obtain in practice, especially across many conditions.

Lorenz Pammer, Daniel T O'Connor, S Andrew Hires, Nathan G Clack, Daniel Huber, Eugene W Myers, Karel Svoboda
The mechanical variables underlying object localization along the axis of the whisker.
J Neurosci, 33(16) 6726-6741 (2013)
Rodents move their whiskers to locate objects in space. Here we used psychophysical methods to show that head-fixed mice can localize objects along the axis of a single whisker, the radial dimension, with one-millimeter precision. High-speed videography allowed us to estimate the forces and bending moments at the base of the whisker, which underlie radial distance measurement. Mice judged radial object location based on multiple touches. Both the number of touches (1-17) and the forces exerted by the pole on the whisker (up to 573 μN; typical peak amplitude, 100 μN) varied greatly across trials. We manipulated the bending moment and lateral force pressing the whisker against the sides of the follicle and the axial force pushing the whisker into the follicle by varying the compliance of the object during behavior. The behavioral responses suggest that mice use multiple variables (bending moment, axial force, lateral force) to extract radial object localization. Characterization of whisker mechanics revealed that whisker bending stiffness decreases gradually with distance from the face over five orders of magnitude. As a result, the relative amplitudes of different stress variables change dramatically with radial object distance. Our data suggest that mice use distance-dependent whisker mechanics to estimate radial object location using an algorithm that does not rely on precise control of whisking, is robust to variability in whisker forces, and is independent of object compliance and object movement. More generally, our data imply that mice can measure the amplitudes of forces in the sensory follicles for tactile sensation.

Hung-Hsiang Yu, Takeshi Awasaki, Michaela Schroeder, Fuhui Long, Jacob S Yang, Yisheng He, Peng Ding, Jui-Chun Kao, Gloria Yueh-Yi Wu, Hanchuan Peng, Gene Myers, Tzumin Lee
Clonal development and organization of the adult Drosophila central brain.
Curr Biol, 23(8) 633-643 (2013)
The insect brain can be divided into neuropils that are formed by neurites of both local and remote origin. The complexity of the interconnections obscures how these neuropils are established and interconnected through development. The Drosophila central brain develops from a fixed number of neuroblasts (NBs) that deposit neurons in regional clusters.

Fernando Amat, Eugene W Myers, Patrick Keller
Fast and robust optical flow for time-lapse microscopy using super-voxels.
Bioinformatics, 29(3) 373-380 (2013)
Optical flow is a key method used for quantitative motion estimation of biological structures in light microscopy. It has also been used as a key module in segmentation and tracking systems and is considered a mature technology in the field of computer vision. However, most of the research focused on 2D natural images, which are small in size and rich in edges and texture information. In contrast, 3D time-lapse recordings of biological specimens comprise up to several terabytes of image data and often exhibit complex object dynamics as well as blurring due to the point-spread-function of the microscope. Thus, new approaches to optical flow are required to improve performance for such data.

Eugene W Myers
What's behind Blast
In: Models and Algorithms for Genome Evolution (MAGE) (2013) Computational Biology 19, New York, Springer (2013), 3-15

Arnim Jenett, Gerald M Rubin, Teri-T B Ngo, David Shepherd, Christine Murphy, Heather Dionne, Barret D Pfeiffer, Amanda Cavallaro, Donald Hall, Jennifer Jeter, Nirmala Iyer, Dona Fetter, Joanna H Hausenfluck, Hanchuan Peng, Eric T Trautman, Robert R Svirskas, Eugene W Myers, Zbigniew R Iwinski, Yoshinori Aso, Gina M DePasquale, Adrianne Enos, Phuson Hulamm, Shing Chun Benny Lam, Hsing-Hsi Li, Todd R Laverty, Fuhui Long, Lei Qu, Sean D Murphy, Konrad Rokicki, Todd Safford, Kshiti Shaw, Julie H Simpson, Allison Sowell, Susana Tae, Yang Yu, Christopher T Zugates
A GAL4-driver line resource for Drosophila neurobiology.
Cell Rep, 2(4) 991-1001 (2012)
Open Access PDF DOI
We established a collection of 7,000 transgenic lines of Drosophila melanogaster. Expression of GAL4 in each line is controlled by a different, defined fragment of genomic DNA that serves as a transcriptional enhancer. We used confocal microscopy of dissected nervous systems to determine the expression patterns driven by each fragment in the adult brain and ventral nerve cord. We present image data on 6,650 lines. Using both manual and machine-assisted annotation, we describe the expression patterns in the most useful lines. We illustrate the utility of these data for identifying novel neuronal cell types, revealing brain asymmetry, and describing the nature and extent of neuronal shape stereotypy. The GAL4 lines allow expression of exogenous genes in distinct, small subsets of the adult nervous system. The set of DNA fragments, each driving a documented expression pattern, will facilitate the generation of additional constructs for manipulating neuronal function.

Gene Myers
Why bioimage informatics matters.
Nat Methods, 9(7) 659-660 (2012)
Driven by the importance of spatial and physical factors in cellular processes and the size and complexity of modern image data, computational analysis of biological imagery has become a vital emerging sub-discipline of bioinformatics and computer vision.

Erhan Bas, Eugene Myers, Mustafa G. Uzunbas, Dimitris Metaxsas
Contextual grouping in a concept: a multistage decision strategy for EM segmentation
In: Conference on Bioinformatics and Biomedicine Barcelona, Spain, 2012 (2012), Amsterdam, Netherlands, Academic Press (2012), 1-8

Jinhyun Kim, Ting Zhao, Ronald S Petralia, Yang Yu, Hanchuan Peng, Eugene Myers, Jeffrey C Magee
mGRASP enables mapping mammalian synaptic connectivity with light microscopy.
Nat Methods, 9(1) 96-102 (2012)
The GFP reconstitution across synaptic partners (GRASP) technique, based on functional complementation between two nonfluorescent GFP fragments, can be used to detect the location of synapses quickly, accurately and with high spatial resolution. The method has been previously applied in the nematode and the fruit fly but requires substantial modification for use in the mammalian brain. We developed mammalian GRASP (mGRASP) by optimizing transmembrane split-GFP carriers for mammalian synapses. Using in silico protein design, we engineered chimeric synaptic mGRASP fragments that were efficiently delivered to synaptic locations and reconstituted GFP fluorescence in vivo. Furthermore, by integrating molecular and cellular approaches with a computational strategy for the three-dimensional reconstruction of neurons, we applied mGRASP to both long-range circuits and local microcircuits in the mouse hippocampus and thalamocortical regions, analyzing synaptic distribution in single neurons and in dendritic compartments.

Nathan G Clack, Daniel T O'Connor, Daniel Huber, Leopoldo Petreanu, Andrew Hires, Simon Peron, Karel Svoboda, Eugene W Myers
Automated tracking of whiskers in videos of head fixed rodents.
PLoS Comput Biol, 8(7) Art. No. e1002591 (2012)
Open Access PDF DOI
We have developed software for fully automated tracking of vibrissae (whiskers) in high-speed videos (>500 Hz) of head-fixed, behaving rodents trimmed to a single row of whiskers. Performance was assessed against a manually curated dataset consisting of 1.32 million video frames comprising 4.5 million whisker traces. The current implementation detects whiskers with a recall of 99.998% and identifies individual whiskers with 99.997% accuracy. The average processing rate for these images was 8 Mpx/s/cpu (2.6 GHz Intel Core2, 2 GB RAM). This translates to 35 processed frames per second for a 640 px×352 px video of 4 whiskers. The speed and accuracy achieved enables quantitative behavioral studies where the analysis of millions of video frames is required. We used the software to analyze the evolving whisking strategies as mice learned a whisker-based detection task over the course of 6 days (8148 trials, 25 million frames) and measure the forces at the sensory follicle that most underlie haptic perception.

Min Liu, Hanchuan Peng, Amit K. Roy-Chowdhury, Eugene Myers
3D Neuron Tip Detection in Volumetric Microscopy Images
In: IEEE international Conference on Bioinformatics and Biomedicine (2011), Piscataway, N.J., IEEE (2011), 366-371

Lei Qu, Fuhui Long, Xiao Liu, Stuart Kim, Eugene Myers, Hanchuan Peng
Simultaneous recognition and segmentation of cells: application in C.elegans.
Bioinformatics, 27(20) 2895-2902 (2011)
Automatic recognition of cell identities is critical for quantitative measurement, targeting and manipulation of cells of model animals at single-cell resolution. It has been shown to be a powerful tool for studying gene expression and regulation, cell lineages and cell fates. Existing methods first segment cells, before applying a recognition algorithm in the second step. As a result, the segmentation errors in the first step directly affect and complicate the subsequent cell recognition step. Moreover, in new experimental settings, some of the image features that have been previously relied upon to recognize cells may not be easy to reproduce, due to limitations on the number of color channels available for fluorescent imaging or to the cost of building transgenic animals. An approach that is more accurate and relies on only a single signal channel is clearly desirable.

Jun Xie, Ting Zhao, Tzumin Lee, Eugene Myers, Hanchuan Peng
Anisotropic path searching for automatic neuron reconstruction.
Med Image Anal, 15(5) 680-689 (2011)
Full reconstruction of neuron morphology is of fundamental interest for the analysis and understanding of their functioning. We have developed a novel method capable of automatically tracing neurons in three-dimensional microscopy data. In contrast to template-based methods, the proposed approach makes no assumptions about the shape or appearance of neurite structure. Instead, an efficient seeding approach is applied to capture complex neuronal structures and the tracing problem is solved by computing the optimal reconstruction with a weighted graph. The optimality is determined by the cost function designed for the path between each pair of seeds and by topological constraints defining the component interrelations and completeness. In addition, an automated neuron comparison method is introduced for performance evaluation and structure analysis. The proposed algorithm is computationally efficient and has been validated using different types of microscopy data sets including Drosophila's projection neurons and fly neurons with presynaptic sites. In all cases, the approach yielded promising results.

Hanchuan Peng, Fuhui Long, Ting Zhao, Eugene Myers
Proof-editing is the bottleneck of 3D neuron reconstruction: the problem and solutions.
Neuroinformatics, 9(2-3) 103-105 (2011)

Ting Zhao, Jun Xie#, Fernando Amat#, Nathan G Clack#, Parvez Ahammad#, Hanchuan Peng#, Fuhui Long#, Eugene Myers#
Automated reconstruction of neuronal morphology based on local geometrical and global structural models.
Neuroinformatics, 9(2-3) 247-261 (2011)
Digital reconstruction of neurons from microscope images is an important and challenging problem in neuroscience. In this paper, we propose a model-based method to tackle this problem. We first formulate a model structure, then develop an algorithm for computing it by carefully taking into account morphological characteristics of neurons, as well as the image properties under typical imaging protocols. The method has been tested on the data sets used in the DIADEM competition and produced promising results for four out of the five data sets.

Markus Decker, Steffen Jaensch, Andrei I. Pozniakovsky, Andrea Zinke, Kevin F O'Connell, Wolfgang Zachariae, Eugene Myers, Anthony A. Hyman
Limiting amounts of centrosome material set centrosome size in C. elegans embryos.
Curr Biol, 21(15) 1259-1267 (2011)
The ways in which cells set the size of intracellular structures is an important but largely unsolved problem [1]. Early embryonic divisions pose special problems in this regard. Many checkpoints common in somatic cells are missing from these divisions, which are characterized by rapid reductions in cell size and short cell cycles [2]. Embryonic cells must therefore possess simple and robust mechanisms that allow the size of many of their intracellular structures to rapidly scale with cell size.

Hanchuan Peng, Phuong Chung, Fuhui Long, Lei Qu, Arnim Jenett, Andrew M Seeds, Eugene W Myers, Julie H Simpson
BrainAligner: 3D registration atlases of Drosophila brains.
Nat Methods, 8(6) 493-500 (2011)
Analyzing Drosophila melanogaster neural expression patterns in thousands of three-dimensional image stacks of individual brains requires registering them into a canonical framework based on a fiducial reference of neuropil morphology. Given a target brain labeled with predefined landmarks, the BrainAligner program automatically finds the corresponding landmarks in a subject brain and maps it to the coordinate system of the target brain via a deformable warp. Using a neuropil marker (the antibody nc82) as a reference of the brain morphology and a target brain that is itself a statistical average of data for 295 brains, we achieved a registration accuracy of 2 μm on average, permitting assessment of stereotypy, potential connectivity and functional mapping of the adult fruit fly brain. We used BrainAligner to generate an image pattern atlas of 2954 registered brains containing 470 different expression patterns that cover all the major compartments of the fly brain.

Jun Xie, Ting Zhao, Tzumin Lee, Eugene Myers, Hanchuan Peng
Automatic neuron tracing in volumetric microscopy images with anisotropic path searching.
In: Medical image computing and computer-assisted intervention - MICCAI 2010 : 13th international conference, Beijing, China, September 20-24, 2010; proceedings, Part II (2010)(Eds.) Tianzi Jiang, Nassir Navab, Josien P. W. Pluim, Max A. Viergever Lecture notes in computer science ; 6362, Dordrecht, Springer (2010), 472-479

Steffen Jaensch, Markus Decker, Anthony A. Hyman, Eugene Myers
Automated tracking and analysis of centrosomes in early Caenorhabditis elegans embryos
Bioinformatics, 26(12) 13-20 (2010)
Motivation: The centrosome is a dynamic structure in animal cells that serves as a microtubule organizing center during mitosis and also regulates cell-cycle progression and sets polarity cues. Automated and reliable tracking of centrosomes is essential for genetic screens that study the process of centrosome assembly and maturation in the nematode Caenorhabditis elegans.

Hanchuan Peng, Zongcai Ruan, Fuhui Long, Julie H Simpson, Eugene W Myers
V3D enables real-time 3D visualization and quantitative analysis of large-scale biological image data sets.
Nat Biotechnol, 28(4) 348-353 (2010)
The V3D system provides three-dimensional (3D) visualization of gigabyte-sized microscopy image stacks in real time on current laptops and desktops. V3D streamlines the online analysis, measurement and proofreading of complicated image patterns by combining ergonomic functions for selecting a location in an image directly in 3D space and for displaying biological measurements, such as from fluorescent probes, using the overlaid surface objects. V3D runs on all major computer platforms and can be enhanced by software plug-ins to address specific biological problems. To demonstrate this extensibility, we built a V3D-based application, V3D-Neuron, to reconstruct complex 3D neuronal structures from high-resolution brain images. V3D-Neuron can precisely digitize the morphology of a single neuron in a fruitfly brain in minutes, with about a 17-fold improvement in reliability and tenfold savings in time compared with other neuron reconstruction tools. Using V3D-Neuron, we demonstrate the feasibility of building a 3D digital atlas of neurite tracts in the fruitfly brain.

Daniel T O'Connor, Nathan G Clack, Daniel Huber, Takaki Komiyama, Eugene W Myers, Karel Svoboda
Vibrissa-based object localization in head-fixed mice.
J Neurosci, 30(5) 1947-1967 (2010)
Linking activity in specific cell types with perception, cognition, and action, requires quantitative behavioral experiments in genetic model systems such as the mouse. In head-fixed primates, the combination of precise stimulus control, monitoring of motor output, and physiological recordings over large numbers of trials are the foundation on which many conceptually rich and quantitative studies have been built. Choice-based, quantitative behavioral paradigms for head-fixed mice have not been described previously. Here, we report a somatosensory absolute object localization task for head-fixed mice. Mice actively used their mystacial vibrissae (whiskers) to sense the location of a vertical pole presented to one side of the head and reported with licking whether the pole was in a target (go) or a distracter (no-go) location. Mice performed hundreds of trials with high performance (>90% correct) and localized to <0.95 mm (<6 degrees of azimuthal angle). Learning occurred over 1-2 weeks and was observed both within and across sessions. Mice could perform object localization with single whiskers. Silencing barrel cortex abolished performance to chance levels. We measured whisker movement and shape for thousands of trials. Mice moved their whiskers in a highly directed, asymmetric manner, focusing on the target location. Translation of the base of the whiskers along the face contributed substantially to whisker movements. Mice tended to maximize contact with the go (rewarded) stimulus while minimizing contact with the no-go stimulus. We conjecture that this may amplify differences in evoked neural activity between trial types.

Shing Chun Benny Lam, Zongcai Ruan, Ting Zhao, Fuhui Long, Arnim Jenett, Julie Simpson, Eugene W Myers, Hanchuan Peng
Segmentation of center brains and optic lobes in 3D confocal images of adult fruit fly brains.
Methods, 50(2) 63-69 (2010)
Automatic alignment (registration) of 3D images of adult fruit fly brains is often influenced by the significant displacement of the relative locations of the two optic lobes (OLs) and the center brain (CB). In one of our ongoing efforts to produce a better image alignment pipeline of adult fruit fly brains, we consider separating CB and OLs and align them independently. This paper reports our automatic method to segregate CB and OLs, in particular under conditions where the signal to noise ratio (SNR) is low, the variation of the image intensity is big, and the relative displacement of OLs and CB is substantial. We design an algorithm to find a minimum-cost 3D surface in a 3D image stack to best separate an OL (of one side, either left or right) from CB. This surface is defined as an aggregation of the respective minimum-cost curves detected in each individual 2D image slice. Each curve is defined by a list of control points that best segregate OL and CB. To obtain the locations of these control points, we derive an energy function that includes an image energy term defined by local pixel intensities and two internal energy terms that constrain the curve's smoothness and length. Gradient descent method is used to optimize this energy function. To improve both the speed and robustness of the method, for each stack, the locations of optimized control points in a slice are taken as the initialization prior for the next slice. We have tested this approach on simulated and real 3D fly brain image stacks and demonstrated that this method can reasonably segregate OLs from CBs despite the aforementioned difficulties.

Eldar Giladi, John Healy#, Eugene W Myers#, Chris Hart#, Philipp Kapranov#, Doron Lipson#, Steve Roels#, Edward Thayer#, Stan Letovsky#
Error Tolerant Indexing and Alignment of Short Reads with Covering Template Families
J Comput Biol, 17(10) 1279-1293 (2010)
The rapid adoption of high-throughput next generation sequence data in biological research is presenting a major challenge for sequence alignment tools—specifically, the efficient alignment of vast amounts of short reads to large references in the presence of differences arising from sequencing errors and biological sequence variations. To address this chal- lenge, we developed a short read aligner for high-throughput sequencer data that is tolerant of errors or mutations of all types—namely, substitutions, deletions, and insertions. The aligner utilizes a multi-stage approach in which template-based indexing is used to identify candidate regions for alignment with dynamic programming. A template is a pair of gapped seeds, with one used with the read and one used with the reference. In this article, we focus on the development of template families that yield error-tolerant indexing up to a given error-budget. A general algorithm for finding those families is presented, and a recursive construction that creates families with higher error tolerance from ones with a lower error tolerance is developed.

Xiao Liu, Fuhui Long, Hanchuan Peng, Sarah J Aerni, Min Jiang, Adolfo Sánchez-Blanco, John I Murray, Elicia A. Preston, Barbara Mericle, Serafim Batzoglou, Eugene W Myers, Stuart K Kim
Analysis of cell fate from single-cell gene expression profiles in C. elegans.
Cell, 139(3) 623-633 (2009)
The C. elegans cell lineage provides a unique opportunity to look at how cell lineage affects patterns of gene expression. We developed an automatic cell lineage analyzer that converts high-resolution images of worms into a data table showing fluorescence expression with single-cell resolution. We generated expression profiles of 93 genes in 363 specific cells from L1 stage larvae and found that cells with identical fates can be formed by different gene regulatory pathways. Molecular signatures identified repeating cell fate modules within the cell lineage and enabled the generation of a molecular differentiation map that reveals points in the cell lineage when developmental fates of daughter cells begin to diverge. These results demonstrate insights that become possible using computational approaches to analyze quantitative expression from many genes in parallel using a digital gene expression atlas.

Fuhui Long, Hanchuan Peng, Xiao Liu, Stuart K Kim, Eugene Myers
A 3D digital atlas of C. elegans and its application to single-cell analyses.
Nat Methods, 6(9) 667-672 (2009)
We built a digital nuclear atlas of the newly hatched, first larval stage (L1) of the wild-type hermaphrodite of Caenorhabditis elegans at single-cell resolution from confocal image stacks of 15 individual worms. The atlas quantifies the stereotypy of nuclear locations and provides other statistics on the spatial patterns of the 357 nuclei that could be faithfully segmented and annotated out of the 558 present at this developmental stage. We then developed an automated approach to assign cell names to each nucleus in a three-dimensional image of an L1 worm. We achieved 86% accuracy in identifying the 357 nuclei automatically. This computational method will allow high-throughput single-cell analyses of the post-embryonic worm, such as gene expression analysis, or ablation or stimulation of cells under computer control in a high-throughput functional screen.

Hanchuan Peng, Fuhui Long, Eugene W Myers
VANO: a volume-object image annotation system.
Bioinformatics, 25(5) 695-697 (2009)
Volume-object annotation system (VANO) is a cross-platform image annotation system that enables one to conveniently visualize and annotate 3D volume objects including nuclei and cells. An application of VANO typically starts with an initial collection of objects produced by a segmentation computation. The objects can then be labeled, categorized, deleted, added, split, merged and redefined. VANO has been used to build high-resolution digital atlases of the nuclei of Caenorhabditis elegans at the L1 stage and the nuclei of Drosophila melanogaster's ventral nerve cord at the late embryonic stage.

Hanchuan Peng, Fuhui Long, Xiao Liu, Stuart K Kim, Eugene W Myers
Straightening Caenorhabditis elegans images.
Bioinformatics, 24(2) 234-242 (2008)
Caenorhabditis elegans, a roundworm found in soil, is a widely studied model organism with about 1000 cells in the adult. Producing high-resolution fluorescence images of C.elegans to reveal biological insights is becoming routine, motivating the development of advanced computational tools for analyzing the resulting image stacks. For example, worm bodies usually curve significantly in images. Thus one must 'straighten' the worms if they are to be compared under a canonical coordinate system.

Fuhui Long, Hanchuan Peng, Xiao Liu, Stuart K Kim, Gene Myers
Automatic recognition of cells (ARC) for 3D images of C. elegans
In: Proc. 2008 Conf. on Computational Molecular Biology (RECOMB) (Singapore, 2008) (2008) Lecture Notes in Computer Science ; 4955, New York, Springer (2008), 128-139

David J Begun, Alisha K Holloway#, Kristian Stevens#, LaDeana W Hillier#, Yu-Ping Poh#, Matthew W Hahn#, Phillip M Nista#, Corbin D Jones#, Andrew D Kern#, Colin N Dewey#, Lior Pachter#, Eugene Myers, Charles H Langley#
Population genomics: whole-genome analysis of polymorphism and divergence in Drosophila simulans.
PLoS Biol, 5(11) 310-310 (2007)
Open Access PDF DOI
The population genetic perspective is that the processes shaping genomic variation can be revealed only through simultaneous investigation of sequence polymorphism and divergence within and between closely related species. Here we present a population genetic analysis of Drosophila simulans based on whole-genome shotgun sequencing of multiple inbred lines and comparison of the resulting data to genome assemblies of the closely related species, D. melanogaster and D. yakuba. We discovered previously unknown, large-scale fluctuations of polymorphism and divergence along chromosome arms, and significantly less polymorphism and faster divergence on the X chromosome. We generated a comprehensive list of functional elements in the D. simulans genome influenced by adaptive evolution. Finally, we characterized genomic patterns of base composition for coding and noncoding sequence. These results suggest several new hypotheses regarding the genetic and biological mechanisms controlling polymorphism and divergence across the Drosophila genome, and provide a rich resource for the investigation of adaptive evolution and functional variation in D. simulans.

Christopher D Smith, Robert C Edgar, Mark D Yandell, Douglas R Smith, Susan E Celniker, Eugene W Myers, Gary H Karpen
Improved repeat identification and masking in Dipterans.
Gene, 389(1) 1-9 (2007)
Repetitive sequences are a major constituent of many eukaryote genomes and play roles in gene regulation, chromosome inheritance, nuclear architecture, and genome stability. The identification of repetitive elements has traditionally relied on in-depth, manual curation and computational determination of close relatives based on DNA identity. However, the rapid divergence of repetitive sequence has made identification of repeats by DNA identity difficult even in closely related species. Hence, the presence of unidentified repeats in genome sequences affects the quality of gene annotations and annotation-dependent analyses (e.g. microarray analyses). We have developed an enhanced repeat identification pipeline using two approaches. First, the de novo repeat finding program PILER-DF was used to identify interspersed repetitive elements in several recently finished Dipteran genomes. Repeats were classified, when possible, according to their similarity to known elements described in Repbase and GenBank, and also screened against annotated genes as one means of eliminating false positives. Second, we used a new program called RepeatRunner, which integrates results from both RepeatMasker nucleotide searches and protein searches using BLASTX. Using RepeatRunner with PILER-DF predictions, we masked repeats in thirteen Dipteran genomes and conclude that combining PILER-DF and RepeatRunner greatly enhances repeat identification in both well-characterized and un-annotated genomes.

Michael M Mwangi, Shang Wei Wu#, Yanjiao Zhou#, Krzysztof Sieradzki#, Herminia de Lencastre#, Paul Richardson#, David Bruce#, Edward Rubin#, Eugene W Myers#, Eric D Siggia#, Alexander Tomasz#
Tracking the in vivo evolution of multidrug resistance in Staphylococcus aureus by whole-genome sequencing
Proc Natl Acad Sci U.S.A., 104(22) 9451-9456 (2007)

Hanchuan Peng, Fuhui Long, Eugene W Myers
In: Proceedings in 4th IEEE International Symposium on Biomedical Imaging: from Nano to Macro, 2007 : ISBI 2007 ; 12 - 15 April 2007, Arlington, Virginia, USA (2007), Piscataway, N.J., IEEE (2007), 292-295

Paul Medvedev, Konstantinos Georgiou, Gene Myers, Michael Brudno
Computability of Models for Sequence Assembly
In: Proceedings in Algorithms in bioinformatics : 7th international workshop, WABI 2007, Philadelphia, PA, USA (2007) Lecture notes in computer science ; 4645, Berlin;Heidelberg, Springer (2007), 289-301

Fuhui Long, Hanchuan Peng, Eugene W Myers
Automatic Segmentation of Nuclei in 3D Microscopy Images of C. Elegans
In: Proceedings in 4th IEEE International Symposium on Biomedical Imaging: from Nano to Macro (2007), Piscataway, N.J., IEEE (2007), 536-539

Hanchuan Peng, Fuhui Long, Jie Zhou, Garmay Leung, Michael B Eisen, Eugene W Myers
Automatic image analysis for gene expression patterns of fly embryos.
BMC Cell Biol, 8 Suppl 1 7-7 (2007)
Open Access PDF DOI
Staining the mRNA of a gene via in situ hybridization (ISH) during the development of a D. melanogaster embryo delivers the detailed spatio-temporal pattern of expression of the gene. Many biological problems such as the detection of co-expressed genes, co-regulated genes, and transcription factor binding motifs rely heavily on the analyses of these image patterns. The increasing availability of ISH image data motivates the development of automated computational approaches to the analysis of gene expression patterns.

Tien-Ho Lin, Eugene W Myers, Eric P Xing
Interpreting anonymous DNA samples from mass disasters--probabilistic forensic inference using genetic markers.
Bioinformatics, 22(14) 298-306 (2006)
The problem of identifying victims in a mass disaster using DNA fingerprints involves a scale of computation that requires efficient and accurate algorithms. In a typical scenario there are hundreds of samples taken from remains that must be matched to the pedigrees of the alleged victim's surviving relatives. Moreover the samples are often degraded due to heat and exposure. To develop a competent method for this type of forensic inference problem, the complicated quality issues of DNA typing need to be handled appropriately, the matches between every sample and every family must be considered, and the confidence of matches need to be provided.

Kim R Rasmussen, Jens Stoye, Eugene W Myers
Efficient q-gram filters for finding all epsilon-matches over a given length.
J Comput Biol, 13(2) 296-308 (2006)
Fast and exact comparison of large genomic sequences remains a challenging task in biosequence analysis. We consider the problem of finding all epsilon-matches between two sequences, i.e., all local alignments over a given length with an error rate of at most epsilon. We study this problem theoretically, giving an efficient q-gram filter for solving it. Two applications of the filter are also discussed, in particular genomic sequence assembly and BLAST-like sequence comparison. Our results show that the method is 25 times faster than BLAST, while not being heuristic.

Hanchuan Peng, Fuhui Long, Michael B Eisen, Eugene W Myers
Clustering gene expression patterns of fly embryos
In: Proceedings in 3rd IEEE International Symposium on Biomedical Imaging: Macro to Nano (2006), Piscataway, N.J., IEEE (2006), 1144-1147

Eugene W Myers
The fragment assembly string graph.
Bioinformatics, 21 Suppl 2 79-85 (2005)
We present a concept and formalism, the string graph, which represents all that is inferable about a DNA sequence from a collection of shotgun sequencing reads collected from it. We give time and space efficient algorithms for constructing a string graph given the collection of overlaps between the reads and, in particular, present a novel linear expected time algorithm for transitive reduction in this context. The result demonstrates that the decomposition of reads into kmers employed in the de Bruijn graph approach described earlier is not essential, and exposes its close connection to the unitig approach we developed at Celera. This paper is a preliminary piece giving the basic algorithm and results that demonstrate the efficiency and scalability of the method. These ideas are being used to build a next-generation whole genome assembler called BOA (Berkeley Open Assembler) that will easily scale to mammalian genomes.

Marcel Margulies, Michael Egholm, William E Altman, Said Attiya, Joel S Bader, Lisa A Bemben, Jan Berka, Michael S Braverman, Yi-Ju Chen, Zhoutao Chen, Scott B Dewell, Lei Du, Joseph M Fierro, Xavier V Gomes, Brian C Godwin, Wen He, Scott Helgesen, Chun Heen Ho, Gerard P Irzyk, Szilveszter C Jando, Maria L I Alenquer, Thomas P Jarvie, Kshama B Jirage, Jong-Bum Kim, James R Knight, Janna R Lanza, John H Leamon, Steven M Lefkowitz, Ming Lei, Jing Li, Kenton L Lohman, Hong Lu, Vinod B Makhijani, Keith E McDade, Michael P McKenna, Eugene W Myers, Elizabeth Nickerson, John R Nobile, Ramona Plant, Bernard P Puc, Michael T Ronan, George T Roth, Gary J Sarkis, Jan Fredrik Simons, John W Simpson, Maithreyan Srinivasan, Karrie R Tartaro, Alexander Tomasz, Kari A Vogt, Greg A Volkmer, Shally H Wang, Yong Wang, Michael P Weiner, Pengguang Yu, Richard F Begley, Jonathan M Rothberg
Genome sequencing in microfabricated high-density picolitre reactors.
Nature, 437(7057) 376-380 (2005)
The proliferation of large-scale DNA-sequencing projects in recent years has driven a search for alternative methods to reduce time and cost. Here we describe a scalable, highly parallel sequencing system with raw throughput significantly greater than that of state-of-the-art capillary electrophoresis instruments. The apparatus uses a novel fibre-optic slide of individual wells and is able to sequence 25 million bases, at 99% or better accuracy, in one four-hour run. To achieve an approximately 100-fold increase in throughput over current Sanger sequencing technology, we have developed an emulsion method for DNA amplification and an instrument for sequencing by synthesis using a pyrosequencing protocol optimized for solid support and picolitre-scale volumes. Here we show the utility, throughput, accuracy and robustness of this system by shotgun sequencing and de novo assembly of the Mycoplasma genitalium genome with 96% coverage at 99.96% accuracy in one run of the machine.

Robert C Edgar, Eugene W Myers
PILER: identification and classification of genomic repeats.
Bioinformatics, 21 Suppl 1 152-158 (2005)
Repeated elements such as satellites and transposons are ubiquitous in eukaryotic genomes. De novo computational identification and classification of such elements is a challenging problem. Therefore, repeat annotation of sequenced genomes has historically largely relied on sequence similarity to hand-curated libraries of known repeat families. We present a new approach to de novo repeat annotation that exploits characteristic patterns of local alignments induced by certain classes of repeats. We describe PILER, a package of efficient search algorithms for identifying such patterns. Novel repeats found using PILER are reported for Homo sapiens, Arabidopsis thalania and Drosophila melanogaster.

Roded Sharan, Eugene W Myers
A motif-based framework for recognizing sequence families.
Bioinformatics, 21 Suppl 1 387-393 (2005)
Many signals in biological sequences are based on the presence or absence of base signals and their spatial combinations. One of the best known examples of this is the signal identifying a core promoter--the site at which the basal transcription machinery starts the transcription of a gene. Our goal is a fully automatic pattern recognition system for a family of sequences, which simultaneously discovers the base signals, their spatial relationships and a classifier based upon them.

Kim R Rasmussen, Jens Stoye, Eugene W Myers
Efficient q-Gram Filters for Finding All ε-Matches over a Given Length
In: Proceedings in Research in computational molecular biology : 9th annual international conference, RECOMB 2005, Cambridge, MA, USA (2005) Lecture notes in computer science ; 3500, Berlin;Heidelberg, Springer (2005), 189-203

Sorin Istrail, Granger G Sutton, Liliana Florea, Aaron L. Halpern, C M Mobarry, Ross Lippert, Brian Walenz, Hagit Shatkay, I M Dew, Jason R Miller, M J Flanigan, Nathan J Edwards, R A Bolanos, D P Fasulo, Bjarni V Halldorsson, Sridhar Hannenhalli, Russell Turner, Shibu Yooseph, Fu Lu, D R Nusskern, Bixiong Chris Shue, Xiangqun Holly Zheng, Fei Zhong, A L Delcher, Daniel H. Huson, S A Kravitz, Laurent Mouchard, Knut Reinert, K A Remington, Andrew Clark, Michael S Waterman, Evan E Eichler, Mark D Adams, Michael W Hunkapiller, Eugene W Myers, J C Venter
Whole-genome shotgun assembly and comparison of human genome assemblies.
Proc Natl Acad Sci U.S.A., 101(7) 1916-1921 (2004)
We report a whole-genome shotgun assembly (called WGSA) of the human genome generated at Celera in 2001. The Celera-generated shotgun data set consisted of 27 million sequencing reads organized in pairs by virtue of end-sequencing 2-kbp, 10-kbp, and 50-kbp inserts from shotgun clone libraries. The quality-trimmed reads covered the genome 5.3 times, and the inserts from which pairs of reads were obtained covered the genome 39 times. With the nearly complete human DNA sequence [National Center for Biotechnology Information (NCBI) Build 34] now available, it is possible to directly assess the quality, accuracy, and completeness of WGSA and of the first reconstructions of the human genome reported in two landmark papers in February 2001 [Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291, 1304-1351; International Human Genome Sequencing Consortium (2001) Nature 409, 860-921]. The analysis of WGSA shows 97% order and orientation agreement with NCBI Build 34, where most of the 3% of sequence out of order is due to scaffold placement problems as opposed to assembly errors within the scaffolds themselves. In addition, WGSA fills some of the remaining gaps in NCBI Build 34. The early genome sequences all covered about the same amount of the genome, but they did so in different ways. The Celera results provide more order and orientation, and the consortium sequence provides better coverage of exact and nearly exact repeats.

Gad M Landau, Eugene Myers, Michal Ziv-Ukelson
Two Algorithms for LCS Consecutive Suffix Alignment
In: Proceedings in Combinatorial Pattern Matching : 15th Annual Symposium, CPM 2004, Istanbul, Turkey (2004) Lecture Notes in Computer Science ; 3109, Berlin;Heidelberg, Springer (2004), 173-193

Hanchuan Peng, Eugene W Myers
Comparing in situ mRNA expression patterns of drosophila embryos
In: Proceedings of the eighth annual international conference on Resaerch in computational molecular biology (2004), New York, ACM (2004), 157-166

A Bernardo Carvalho, Maria D Vibranovski, Joseph W Carlson, Susan E Celniker, Roger A Hoskins, Gerald M Rubin, Granger G Sutton, Mark D Adams, Eugene W Myers, Andrew Clark
Y chromosome and other heterochromatic sequences of the Drosophila melanogaster genome: how far can we go?
Genetica, 117(2-3) 227-237 (2003)
Whole genome shotgun assemblies have proven remarkably successful in reconstructing the bulk of euchromatic genes, with the only limit appearing to be determined by the sequencing depth. For genes imbedded in heterochromatin, however, the low cloning efficiency of repetitive sequences, combined with the computational challenges, demand that additional clues be used to annotate the sequences. One approach that has proven very successful in identifying protein coding genes in Y-linked heterochromatin of Drosophila melanogaster has been to make a BLASTable database of the small, unmapped contigs and fragments leftover at the end of a shotgun assembly, and to attempt to capture these by blasting with an appropriate query sequence. This approach often yields a staggered alignment of contigs from the unmapped set to the query sequence, as though the disjoint contigs represent small portions of the gene. Further inspection frequently shows that the contigs are broken by very large, heterochromatic introns. Methods of this sort are being expanded to make best use of all available clues to determine which unmapped contigs are associated with genes. These include use of EST libraries, and, in the case of the Y chromosome, testing of male specific genes and reduced shotgun depth of relevant contigs. It appears much more hopeful than anyone would have imagined that whole genome shotgun assemblies can recover the great bulk of even heterochromatic genes.

Mark D Adams, Granger G Sutton, Hamilton O Smith, Eugene W Myers, J C Venter
The independence of our genome assemblies.
Proc Natl Acad Sci U.S.A., 100(6) 3025-3026 (2003)

Gene Myers, Richard Durbin
A table-driven, full-sensitivity similarity search algorithm.
J Comput Biol, 10(2) 103-117 (2003)
Searching a database for a local alignment to a query under a typical scoring scheme, such as PAM120 or BLOSUM62 with affine gap costs, is a computation that has resisted algorithmic improvement due to its basis in dynamic programming and the weak nature of the signals being searched for. In a query preprocessing step, a set of tables can be built that permit one to (a) eliminate a large fraction of the dynamic programming matrix from consideration and (b) to compute several steps of the remainder with a single table lookup. While this result is not an asymptotic improvement over the original Smith-Waterman algorithm, its complexity is characterized in terms of some sparse features of the matrix and it yields the fastest software implementation to date for such searches.

Robert A Holt, G Mani Subramanian, Aaron L. Halpern, Granger G Sutton, Rosane Charlab, D R Nusskern, Patrick Wincker, Andrew Clark, José M C Ribeiro, Ron Wides, Steven L Salzberg, Brendan Loftus, Mark D Yandell, William H Majoros, Douglas B Rusch, Zhongwu Lai, Cheryl L Kraft, Josep F Abril, Veronique Anthouard, Peter Arensburger, Peter W Atkinson, Holly Baden, Veronique de Berardinis, Danita Baldwin, Vladimir Benes, Jim Biedler, Claudia Blass, R A Bolanos, Didier Boscus, Mary Barnstead, Shuang Cai, Angela Center, Kabir Chaturverdi, George K Christophides, Mathew A Chrystal, Michele Clamp, Anibal Cravchik, Val Curwen, Ali Dana, A L Delcher, I M Dew, Cheryl A Evans, M J Flanigan, Anne Grundschober-Freimoser, Lisa Friedli, Zhiping Gu, Ping Guan, Roderic Guigo, Maureen E Hillenmeyer, Susanne L Hladun, James R Hogan, Young S Hong, Jeffrey Hoover, Olivier Jaillon, Zhaoxi Ke, Chinnappa Kodira, Elena Kokoza, Anastasios Koutsos, Ivica Letunic, Alex Levitsky, Yong Liang, Jhy-Jhu Lin, Neil F Lobo, John R Lopez, Joel A Malek, Tina C McIntosh, Stephan Meister, Jason Miller, C M Mobarry, Emmanuel Mongin, Sean D Murphy, David A O'Brochta, Cynthia Pfannkoch, Rong Qi, Megan A Regier, K A Remington, Hongguang Shao, Maria V Sharakhova, Cynthia D Sitter, Jyoti Shetty, Thomas J Smith, Renee Strong, Jingtao Sun, Dana Thomasova, Lucas Q Ton, Pantelis Topalis, Zhijian Tu, Maria F Unger, Brian Walenz, Aihui Wang, Jian Wang, Mei Wang, Xuelan Wang, Kerry J Woodford, Jennifer R Wortman, Martin Wu, Alison Yao, Evgeny M Zdobnov, Hongyu Zhang, Qi Zhao, Senming Zhao, Shiaoping C Zhu, Igor Zhimulev, Mario Coluzzi, Alessandra della Torre, Charles W Roth, Christos Louis, Francis Kalush, Richard J Mural, Eugene W Myers, Mark D Adams, Hamilton O Smith, Samuel Broder, Melissa Gardner, Claire M Fraser, Ewan Birney, Peer Bork, Paul T Brey, J Craig Venter, Jean Weissenbach, Fotis C Kafatos, Frank H Collins, Stephen L Hoffman
The genome sequence of the malaria mosquito Anopheles gambiae.
Science, 298(5591) 129-149 (2002)
Anopheles gambiae is the principal vector of malaria, a disease that afflicts more than 500 million people and causes more than 1 million deaths each year. Tenfold shotgun sequence coverage was obtained from the PEST strain of A. gambiae and assembled into scaffolds that span 278 million base pairs. A total of 91% of the genome was organized in 303 scaffolds; the largest scaffold was 23.1 million base pairs. There was substantial genetic variation within this strain, and the apparent existence of two haplotypes of approximately equal frequency ("dual haplotypes") in a substantial fraction of the genome likely reflects the outbred nature of the PEST strain. The sequence produced a conservative inference of more than 400,000 single-nucleotide polymorphisms that showed a markedly bimodal density distribution. Analysis of the genome sequence revealed strong evidence for about 14,000 protein-encoding transcripts. Prominent expansions in specific families of proteins likely involved in cell adhesion and immunity were noted. An expressed sequence tag analysis of genes regulated by blood feeding provided insights into the physiological adaptations of a hematophagous insect.

Daniel H. Huson, Knut Reinert, Eugene W Myers
The greedy path-merging algorithm for contig scaffolding
Journal of the ACM (JACM), 49(5) 603-615 (2002)

Jeffrey A Bailey, Zhiping Gu, Royden A Clark, Knut Reinert, Rhea V Samonte, Stuart Schwartz, Mark D Adams, Eugene W Myers, Peter W Li, Evan E Eichler
Recent segmental duplications in the human genome.
Science, 297(5583) 1003-1007 (2002)
Primate-specific segmental duplications are considered important in human disease and evolution. The inability to distinguish between allelic and duplication sequence overlap has hampered their characterization as well as assembly and annotation of our genome. We developed a method whereby each public sequence is analyzed at the clone level for overrepresentation within a whole-genome shotgun sequence. This test has the ability to detect duplications larger than 15 kilobases irrespective of copy number, location, or high sequence similarity. We mapped 169 large regions flanked by highly similar duplications. Twenty-four of these hot spots of genomic instability have been associated with genetic disease. Our analysis indicates a highly nonrandom chromosomal and genic distribution of recent segmental duplications, with a likely role in expanding protein diversity.

Richard J Mural, Mark D Adams, Eugene W Myers, Hamilton O Smith, George L Gabor Miklos, Ron Wides, Aaron L. Halpern, Peter W Li, Granger G Sutton, Joe Nadeau, Steven L Salzberg, Robert A Holt, Chinnappa Kodira, Fu Lu, Lin Chen, Zuoming Deng, Carlos C Evangelista, Weiniu Gan, Thomas J Heiman, Jiayin Li, Zhenya Li, Gennady V Merkulov, Natalia V Milshina, Ashwinikumar K Naik, Rong Qi, Bixiong Chris Shue, Aihui Wang, Jian Wang, Xin Wang, Xianghe Yan, Jane Ye, Shibu Yooseph, Qi Zhao, Liansheng Zheng, Shiaoping C Zhu, Kendra Biddick, R A Bolanos, A L Delcher, I M Dew, D P Fasulo, M J Flanigan, Daniel H. Huson, S A Kravitz, Jason R Miller, C M Mobarry, Knut Reinert, K A Remington, Qinyu Zhang, Xiangqun H Zheng, D R Nusskern, Zhongwu Lai, Yiding Lei, Wenyan Zhong, Alison Yao, Ping Guan, Rui-Ru Ji, Zhiping Gu, Zhen-Yuan Wang, Fei Zhong, Chunlin Xiao, Chia-Chien Chiang, Mark D Yandell, Jennifer R Wortman, Peter G Amanatides, Susanne L Hladun, Eric C Pratts, Jeffery E Johnson, Kristina L Dodson, Kerry J Woodford, Cheryl A Evans, Barry Gropman, Douglas B Rusch, Eli Venter, Mei Wang, Thomas J Smith, Jarrett T Houck, Donald E Tompkins, Charles Haynes, Debbie Jacob, Soo H Chin, David R Allen, Carl E Dahlke, Robert Sanders, Kelvin Li, Xiangjun Liu, Alex Levitsky, William H Majoros, Quan Chen, Ashley C Xia, John R Lopez, Michael T Donnelly, Matthew H Newman, Anna Glodek, Cheryl L Kraft, Marc Nodell, Feroze Ali, Hui-Jin An, Danita Baldwin-Pitts, Karen Y Beeson, Shuang Cai, Mark Carnes, Amy Carver, Parris M Caulk, Angela Center, Yen-Hui Chen, Ming-Lai Cheng, My D Coyne, Michelle Crowder, Steven Danaher, Lionel B Davenport, Raymond Desilets, Susanne M Dietz, Lisa Doup, Patrick Dullaghan, Steven Ferriera, Carl R Fosler, Harold C Gire, Andres Gluecksmann, Jeannine D Gocayne, Jonathan Gray, Brit Hart, Jason Haynes, Jeffrey Hoover, Tim Howland, Chinyere Ibegwam, Mena Jalali, David Johns, Leslie Kline, Daniel S Ma, Steven MacCawley, Anand Magoon, Felecia Mann, David May, Tina C McIntosh, Somil Mehta, Linda Moy, Mee C Moy, Brian J Murphy, Sean D Murphy, Keith A Nelson, Zubeda Nuri, Kimberly A Parker, Alexandre C Prudhomme, Vinita N Puri, Hina Qureshi, John C Raley, Matthew S Reardon, Megan A Regier, Yu-Hui C Rogers, Deanna L Romblad, Jakob Schutz, John L Scott, Richard Scott, Cynthia D Sitter, Michella Smallwood, Andrew Sprague, Erin Stewart, Renee Strong, Ellen Suh, Karena Sylvester, Reginald Thomas, Ni Ni Tint, Christopher Tsonis, Gary Wang, George Wang, Monica S Williams, Sherita M Williams, Sandra M Windsor, Keriellen Wolfe, Mitchell M Wu, Jayshree Zaveri, Kabir Chaturvedi, Andrei E Gabrielian, Zhaoxi Ke, Jingtao Sun, Gangadharan Subramanian, J Craig Venter, Cynthia Pfannkoch, Mary Barnstead, Lisa D Stephenson
A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome.
Science, 296(5573) 1661-1671 (2002)
The high degree of similarity between the mouse and human genomes is demonstrated through analysis of the sequence of mouse chromosome 16 (Mmu 16), which was obtained as part of a whole-genome shotgun assembly of the mouse genome. The mouse genome is about 10% smaller than the human genome, owing to a lower repetitive DNA content. Comparison of the structure and protein-coding potential of Mmu 16 with that of the homologous segments of the human genome identifies regions of conserved synteny with human chromosomes (Hsa) 3, 8, 12, 16, 21, and 22. Gene content and order are highly conserved between Mmu 16 and the syntenic blocks of the human genome. Of the 731 predicted genes on Mmu 16, 509 align with orthologs on the corresponding portions of the human genome, 44 are likely paralogous to these genes, and 164 genes have homologs elsewhere in the human genome; there are 14 genes for which we could find no human counterpart.

Eugene W Myers, Granger G Sutton, Hamilton O Smith, Mark D Adams, J Craig Venter
On the sequencing and assembly of the human genome.
Proc Natl Acad Sci U.S.A., 99(7) 4145-4146 (2002)

Roger A Hoskins, Christopher D Smith, Joseph W Carlson, A Bernardo Carvalho, Aaron L. Halpern, Joshua S Kaminker, Cameron Kennedy, Chris J Mungall, Beth A Sullivan, Granger G Sutton, Jiro C Yasuhara, Barbara T Wakimoto, Eugene W Myers, Susan E Celniker, Gerald M Rubin, Gary H Karpen
Heterochromatic sequences in a Drosophila whole-genome shotgun assembly.
Genome Biol, 3(12) 85-85 (2002)
Open Access PDF
Most eukaryotic genomes include a substantial repeat-rich fraction termed heterochromatin, which is concentrated in centric and telomeric regions. The repetitive nature of heterochromatic sequence makes it difficult to assemble and analyze. To better understand the heterochromatic component of the Drosophila melanogaster genome, we characterized and annotated portions of a whole-genome shotgun sequence assembly.

Susan E Celniker, David A Wheeler, Brent Kronmiller, Joseph W Carlson, Aaron L. Halpern, Sandeep Patel, Mark Adams, Mark Champe, Shannon P Dugan, Erwin Frise, Ann Hodgson, Reed A George, Roger A Hoskins, Todd R Laverty, Donna M Muzny, Catherine R Nelson, Joanne M Pacleb, Soo Park, Barret D Pfeiffer, Stephen Richards, Erica J Sodergren, Robert R Svirskas, Paul E Tabor, Kenneth Wan, Mark Stapleton, Granger G Sutton, Craig Venter, George Weinstock, Steven E Scherer, Eugene W Myers, Richard A Gibbs, Gerald M Rubin
Finishing a whole-genome shotgun: release 3 of the Drosophila melanogaster euchromatic genome sequence.
Genome Biol, 3(12) Art. No. research0079.1–0079.14 (2002)
Open Access PDF
The Drosophila melanogaster genome was the first metazoan genome to have been sequenced by the whole-genome shotgun (WGS) method. Two issues relating to this achievement were widely debated in the genomics community: how correct is the sequence with respect to base-pair (bp) accuracy and frequency of assembly errors? And, how difficult is it to bring a WGS sequence to the accepted standard for finished sequence? We are now in a position to answer these questions.

J C Venter, M D Adams, E W Myers, P W Li, Richard J Mural, Granger G Sutton, H O Smith, Mark D Yandell, Cheryl A Evans, Robert A Holt, Jeannine D Gocayne, Peter G Amanatides, R M Ballew, Daniel H. Huson, Jennifer R Wortman, Qinyu Zhang, Chinnappa Kodira, X H Zheng, L Chen, M Skupski, G Subramanian, P D Thomas, J Zhang, George L Gabor Miklos, C Nelson, Samuel Broder, Andrew Clark, Joe Nadeau, V A McKusick, N Zinder, A J Levine, R J Roberts, M Simon, C Slayman, Michael W Hunkapiller, R A Bolanos, A L Delcher, I M Dew, D P Fasulo, M J Flanigan, Liliana Florea, Aaron L. Halpern, Sridhar Hannenhalli, S A Kravitz, S Levy, C M Mobarry, Knut Reinert, K A Remington, J Abu-Threideh, E M Beasley, Kendra Biddick, V Bonazzi, R C Brandon, M Cargill, I Chandramouliswaran, Rosane Charlab, Kabir Chaturvedi, Zuoming Deng, V Di Francesco, P Dunn, K Eilbeck, Carlos C Evangelista, Andrei E Gabrielian, Weiniu Gan, W Ge, F Gong, Z Gu, Ping Guan, T J Heiman, M E Higgins, Rui-Ru Ji, Z Ke, K A Ketchum, Z Lai, Y Lei, Z Li, J Li, Y Liang, X Lin, F Lu, Gennady V Merkulov, Natalia V Milshina, H M Moore, Ashwinikumar K Naik, V A Narayan, B Neelam, D R Nusskern, Douglas B Rusch, Steven L Salzberg, W Shao, Bixiong Chris Shue, J Sun, Z Wang, A Wang, X Wang, J Wang, M Wei, Ron Wides, C Xiao, C Yan, Alison Yao, Jane Ye, M Zhan, W Zhang, H Zhang, Q Zhao, L Zheng, F Zhong, W Zhong, S Zhu, Senming Zhao, D Gilbert, S Baumhueter, G Spier, Crystal N. Carter, Anibal Cravchik, T Woodage, F Ali, Hui-Jin An, A Awe, Danita Baldwin, Holly Baden, Mary Barnstead, I Barrow, Karen Y Beeson, D Busam, Amy Carver, Angela Center, M L Cheng, L Curry, Steven Danaher, Lionel B Davenport, Raymond Desilets, Susanne M Dietz, Kristina L Dodson, Lisa Doup, Steven Ferriera, N Garg, Andres Gluecksmann, Brit Hart, J Haynes, C Haynes, C Heiner, Susanne L Hladun, D Hostin, Jarrett T Houck, Tim Howland, Chinyere Ibegwam, J Johnson, Francis Kalush, Leslie Kline, S Koduru, A Love, F Mann, David May, S McCawley, T McIntosh, I McMullen, M Moy, L Moy, B Murphy, K Nelson, Cynthia Pfannkoch, Eric C Pratts, V Puri, Hina Qureshi, Matthew S Reardon, R Rodriguez, Y H Rogers, Deanna L Romblad, B Ruhfel, R Scott, Cynthia D Sitter, Michella Smallwood, E Stewart, Renee Strong, E Suh, R Thomas, Ni Ni Tint, S Tse, C Vech, G Wang, J Wetter, S Williams, M Williams, Sandra M Windsor, E Winn-Deen, Keriellen Wolfe, Jayshree Zaveri, K Zaveri, Josep F Abril, R Guigó, M J Campbell, K V Sjolander, B Karlak, A Kejariwal, H Mi, B Lazareva, T Hatton, A Narechania, K Diemer, A Muruganujan, N Guo, S Sato, V Bafna, Sorin Istrail, Ross Lippert, R Schwartz, Brian Walenz, Shibu Yooseph, D Allen, A Basu, J Baxendale, L Blick, M Caminha, J Carnes-Stine, Parris M Caulk, Y H Chiang, My D Coyne, Carl E Dahlke, A Mays, M Dombroski, Michael T Donnelly, D Ely, S Esparham, Carl R Fosler, Harold C Gire, S Glanowski, K Glasser, Anna Glodek, M Gorokhov, K Graham, Barry Gropman, M Harris, J Heil, S Henderson, Jeffrey Hoover, D Jennings, C M Jordan, J Jordan, J Kasha, L Kagan, C Kraft, Alex Levitsky, M Lewis, X Liu, J Lopez, D Ma, William H Majoros, J McDaniel, S Murphy, Matthew H Newman, T Nguyen, N Nguyen, Marc Nodell, S Pan, J Peck, M Peterson, W Rowe, Robert Sanders, J Scott, M Simpson, T Smith, Andrew Sprague, T Stockwell, R Turner, E Venter, M Wang, M Wen, D Wu, M Wu, Ashley C Xia, A Zandieh, X Zhu
The sequence of the human genome.
Science, 291(5507) 1304-1351 (2001)
A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

Gene Myers
Optimally separating sequences.
Genome Inform, 12 165-174 (2001)
We consider the problem of separating two distinct classes of k similar sequences of length n over an alphabet of size s that have been optimally multi-aligned. An objective function based on minimizing the consensus score of the separated halves is introduced and we present an O(k3n) heuristic algorithm and two optimal branch-and-bound algorithms for the problem. The branch-and-bound algorithms involve progressively more powerful lower bound functions for pruning the O(2k) search tree. The simpler lower bound takes O(n) time to evaluate given O(sn) global data structures and the stronger bound takes O((k+s)n) time by virtue of an efficient solution to the problem of finding the second-maximum envelope of a set of piece-wise affine curves. In a series of empirical trials we establish the degree to which classes can be separated using our metric and the effective pruning efficiency of the two branch-and-bound algorithms.

Daniel H. Huson, Knut Reinert, S A Kravitz, K A Remington, A L Delcher, I M Dew, M J Flanigan, Aaron L. Halpern, Z Lai, C M Mobarry, Granger G Sutton, E W Myers
Design of a compartmentalized shotgun assembler for the human genome.
Bioinformatics, 17 Suppl 1 132-139 (2001)
Two different strategies for determining the human genome are currently being pursued: one is the "clone-by-clone" approach, employed by the publicly funded project, and the other is the "whole genome shotgun assembler" approach, favored by researchers at Celera Genomics. An interim strategy employed at Celera, called compartmentalized shotgun assembly, makes use of preliminary data produced by both approaches. In this paper we describe the design, implementation and operation of the "compartmentalized shotgun assembler".

Gene Myers
Comparing sequence scaffolds
In: Proceedings of the fifth annual international conference on Computational biology (2001), New York, ACM (2001), 224-230

Daniel H. Huson, Aaron L. Halpern, Zhongwu Lai, Eugene W Myers
Comparing Assemblies Using Fragments and Mate-Pairs
In: Proceedings in Algorithms in Bioinformatics : First International Workshop, WABI 2001 Århus Denmark (2001) Lecture notes in computer science ; 2149, Berlin;Heidelberg, Springer (2001), 294-306

M D Adams, Susan E Celniker, Robert A Holt, Cheryl A Evans, Jeannine D Gocayne, Peter G Amanatides, Steven E Scherer, P W Li, Roger A Hoskins, R F Galle, Reed A George, S E Lewis, Stephen Richards, Michael Ashburner, S Henderson, Granger G Sutton, Jennifer R Wortman, Mark D Yandell, Qinyu Zhang, L X Chen, R C Brandon, Y H Rogers, R G Blazej, Mark Champe, B D Pfeiffer, Kenneth Wan, C Doyle, E G Baxter, G Helt, C R Nelson, G L Gabor, Josep F Abril, A Agbayani, Hui-Jin An, C Andrews-Pfannkoch, Danita Baldwin, R M Ballew, A Basu, J Baxendale, L Bayraktaroglu, E M Beasley, Karen Y Beeson, P V Benos, Benjamin P Berman, D Bhandari, S Bolshakov, D Borkova, M R Botchan, J Bouck, P Brokstein, P Brottier, K C Burtis, D Busam, H Butler, E Cadieu, Angela Center, I Chandra, J M Cherry, S Cawley, Carl E Dahlke, Lionel B Davenport, P Davies, B de Pablos, A L Delcher, Zuoming Deng, A Mays, I M Dew, Susanne M Dietz, Kristina L Dodson, Lisa Doup, M Downes, S Dugan-Rocha, B C Dunkov, P Dunn, K J Durbin, Carlos C Evangelista, C Ferraz, Steven Ferriera, W Fleischmann, Carl R Fosler, Andrei E Gabrielian, N Garg, W M Gelbart, K Glasser, Anna Glodek, F Gong, J H Gorrell, Z Gu, Ping Guan, M Harris, N L Harris, D Harvey, T J Heiman, J R Hernandez, Jarrett T Houck, D Hostin, K A Houston, Tim Howland, M H Wei, Chinyere Ibegwam, Mena Jalali, Francis Kalush, Gary H Karpen, Z Ke, J A Kennison, K A Ketchum, B E Kimmel, Chinnappa Kodira, C Kraft, S A Kravitz, D Kulp, Z Lai, P Lasko, Y Lei, Alex Levitsky, J Li, Z Li, Y Liang, X Lin, X Liu, B Mattei, T C McIntosh, M P McLeod, D McPherson, Gennady V Merkulov, Natalia V Milshina, C M Mobarry, J Morris, A Moshrefi, S M Mount, M Moy, B Murphy, L Murphy, Donna M Muzny, D L Nelson, D R Nelson, K A Nelson, K Nixon, D R Nusskern, Joanne M Pacleb, M Palazzolo, G S Pittman, S Pan, J Pollard, V Puri, M G Reese, Knut Reinert, K A Remington, R D Saunders, F Scheeler, H Shen, Bixiong Chris Shue, I Sidén-Kiamos, M Simpson, M Skupski, T Smith, E Spier, A C Spradling, Mark Stapleton, Renee Strong, E Sun, Robert R Svirskas, C Tector, R Turner, E Venter, A H Wang, X Wang, Z Y Wang, D A Wassarman, George Weinstock, Jean Weissenbach, S M Williams, S M WoodageT, K C Worley, D Wu, S Yang, Q A Yao, Jane Ye, R F Yeh, J S Zaveri, M Zhan, G Zhang, Q Zhao, L Zheng, X H Zheng, F N Zhong, W Zhong, X Zhou, S Zhu, X Zhu, H O Smith, Richard A Gibbs, E W Myers, G M Rubin, J C Venter
The genome sequence of Drosophila melanogaster.
Science, 287(5461) 2185-2195 (2000)
The fly Drosophila melanogaster is one of the most intensively studied organisms in biology and serves as a model system for the investigation of many developmental and cellular processes common to higher eukaryotes, including humans. We have determined the nucleotide sequence of nearly all of the approximately 120-megabase euchromatic portion of the Drosophila genome using a whole-genome shotgun sequencing strategy supported by extensive clone-based sequence and a high-quality bacterial artificial chromosome physical map. Efforts are under way to close the remaining gaps; however, the sequence is of sufficient accuracy and contiguity to be declared substantially complete and to support an initial analysis of genome structure and preliminary gene annotation and interpretation. The genome encodes approximately 13,600 genes, somewhat fewer than the smaller Caenorhabditis elegans genome, but with comparable functional diversity.

E W Myers, Granger G Sutton, A L Delcher, I M Dew, D P Fasulo, M J Flanigan, S A Kravitz, C M Mobarry, Knut Reinert, K A Remington, Eric Anson, R A Bolanos, H H Chou, C M Jordan, Aaron L. Halpern, S Lonardi, E M Beasley, R C Brandon, L Chen, P J Dunn, Z Lai, Y Liang, D R Nusskern, M Zhan, Qinyu Zhang, X Zheng, G M Rubin, M D Adams, J C Venter
A whole-genome assembly of Drosophila.
Science, 287(5461) 2196-2204 (2000)
We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accomplished it. Three independent external data sources essentially agree with and support the assembly's sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochromatin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99. 99% without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientific community.

Gene Myers
Whole-Genome DNA Sequencing
Comput Sci Eng, 1(3) 33-43 (1999)

Gene Myers
A fast bit-vector algorithm for approximate string matching based on dynamic programming
Journal of the ACM (JACM), 46(3) 395-415 (1999)

Eric Anson, Gene Myers
Algorithms for Whole Genome Shotgun Sequencing
In: Proceedings of the third annual international conference on Computational molecular biology (1999), New York, ACM (1999), 1-9

Gene Myers
A Dataset Generator for Whole Genome Shotgun Sequencing
In: Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology (1999), Heidelberg, Germany, Intelligent Systems for Molecular Biology (1999), 202-210

S Levy, L Compagnoni, E W Myers, G D Stormo
Xlandscape: the graphical display of word frequencies in sequences.
Bioinformatics, 14(1) 74-80 (1998)
To provide a graphical interface for the generation, display and manipulation of a sequence landscape that will run on all X-windows-based Unix workstations.

Eugene W Myers, Paulo Oliva, Katia S. Guimarães
Reporting Exact and Approximate Regular Expression Matches
In: Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching (1998) Lecture Notes in Computer Science ; 1448, New York, Springer (1998), 91-103

Gad M Landau, Eugene W Myers#, Jeanette P Schmidt#
Incremental String Comparison
SIAM J Sci Comput, 27(2) 557-582 (1998)
The problem of comparing two sequences A and B to determine their longest common subsequence (LCS) or the edit distance between them has been much studied. In this paper we consider the following incremental version of these problems: given an appropriate encoding of a comparison between A and B, can one incrementally compute the answer for A and bB, and the answer for A and Bb with equal efficiency, where b is an additional symbol? Our main result is a theorem exposing a surprising relationship between the dynamic programming solutions for two such "adjacent" problems. Given a threshold k on the number of differences to be permitted in an alignment, the theorem leads directly to an O(k) algorithm for incrementally computing a new solution from an old one, as contrasts the O(k2 ) time required to compute a solution from scratch. We further show, with a series of applications, that this algorithm is indeed more powerful than its nonincremental counterpart. We show this by solving the applications with greater asymptotic efficiency than heretofore possible. For example, we obtain O(nk) algorithms for the longest prefix approximate match problem, the approximate overlap problem, and cyclic string comparison.

M F Sagot, E W Myers
Identifying satellites and periodic repetitions in biological sequences.
J Comput Biol, 5(3) 539-553 (1998)
We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. We concentrate here on short to moderately long (up to 30-40 base pairs) approximate tandem repeats where copies may differ up to epsilon = 15-20% from a consensus model of the repeating unit (implying individual units may vary by 2 epsilon from each other). The algorithm is composed of two parts. The first one consists of a filter that basically eliminates all regions whose probability of containing a satellite is less than one in 10(4) when epsilon = 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence. It therefore has the advantage over previous work of being able to report a consensus model, say m, of the repeated unit as well as the span of the satellite. The first phase was designed for efficiency and takes only O (n) time where n is the length of the sequence. The second phase was designed for sensitivity and takes time O (n . N (e, k)) in the worst case where k is the length of the repeating unit m, e = [epsilon k] is the number of differences allowed between each repeat unit and the model m, and N (e, k) is the maximum number of words that are not more than e differences from another word of length k. That is, N (e, k) is the maximum size of an e-neighborhood of a string of length k. Experiments reveal the second phase to be considerably faster in practice than the worst-case complexity bound suggests. Finally, the present algorithm is easily adapted to finding tandem repeats in protein sequences, as well as extended to identifying mixed direct-inverse tandem repeats.

J L Weber, E W Myers
Human whole-genome shotgun sequencing.
Genome Res, 7(5) 401-409 (1997)

Eric Anson, E W Myers
ReAligner: a program for refining DNA sequence multi-alignments.
J Comput Biol, 4(3) 369-383 (1997)
We present a round-robin realignment algorithm that improves a potentially crude initial alignment of an assembled collection of DNA sequence fragments, as might, for example, be output by a typical fragment assembly program. The algorithm uses a weighted combination of two scoring schemes to achieve superior multi-alignments, and employs a banded dynamic programming variation to achieve a running time that is linear in the amount of sequence in the data set. We demonstrate that the algorithm improves upon the alignments produced by other assembly programs in a series of empirical experiments on simulated data. Finally, we present a pair of programs embodying the algorithms that are available from the Web site ftp://ftp.cs.arizona.edu/realigner.

Mudita Jain, E W Myers
Algorithms for computing and integrating physical maps using unique probes.
J Comput Biol, 4(4) 449-466 (1997)
Current physical mapping projects based on STS-probes involve additional clues such as the fact that some probes are anchored to a known map and that others come from the ends of clones. Because of the disparate combinatorial contributions of these varied data items, it is difficult to design a "tailored" algorithm that incorporates them all. Moreover, it is inevitable that new experiments will provide new kinds of data, making obsolete any such algorithm. We show how to convert the physical mapping problem into a 0/1 linear programming (LP) problem. We further show how one can incorporate additional clues as additional constraints in the LP formulation. We give a simple relaxation of the 0/1 LP problem, which solves problems of the same scale as previously reported tailored algorithms, to equal or greater optimization levels. We also present a theorem proving that when the data is 100% accurate, then the relaxed and integer solutions coincide. The LP algorithm suffices to solve problems on the order of 80-100 probes--the typical size of the 2- or 3-connected contigs of Arratia et al. (1991). We give a heuristic algorithm which attempts to order and link the set of LP-solved contigs. Unlike previous work, this algorithm only links and orders contigs when the join is 90% or more likely to be correct. It is our view that there is no value in computing an optimal solution with respect to some criteria over very noisy data as this optimal solution rarely corresponds to the true solution. The paper involves extensive empirical trials over real and simulated data.

Eugene W Myers, Sanford Selznick#, Zheng Zhang#, Webb Miller#
Progressive Multiple Alignment with Constraints
J Comput Biol, 3(4) 563-572 (1996)
A progressive alignment algorithm produces a multi-alignment of a set of sequences by repeatedly aligning pairs of sequences and/or previously gen- erated alignments. We describe a method for guaranteeing that the alignment generated by a progressive alignment strategy satisfies a user-specified collec- tion of constraints about where certain sequence positions should appear rel- ative to others. Given a collection of constraints over sequences whose total length is , our algorithm takes time. An alignment of the -like globin gene clusters of several mammals illustrates the practi- cality of the method.

Gene Myers, Mudita Jain
Going Against the Grain
In: Proceedings of the Third South American Workshop on String Processing : Recife, Brazil (1996), Recife, Brazil, South American Workshop on String Processing (1996), 203-213

E W Myers
Approximate matching of network expressions with spacers.
J Comput Biol, 3(1) 33-51 (1996)
Two algorithmic results are presented that are pertinent to the matching of patterns typically used by biologists to describe regions of macromolecular sequences that encode a given function. The first result is a threshold-sensitive algorithm for approximately matching both network and regular expressions. Network expressions are regular expressions that can be composed only from union and concatenation operators. Kleene closure (i.e., unbounded repetition) is not permitted. The algorithm is threshold-sensitive in that its performance depends on the threshold, k, of the number of differences allowed in an approximate match. This result generalizes the O(kn) expected-time algorithm of Ukkonen for approximately matching keywords. The second result concerns the problem of matching a pattern that is a network expression whose elements are approximate matches to network or regular expressions interspersed with specifiable distance ranges. For this class of patterns, it is shown how to determine a backtracking procedure whose order of evaluation is optimal in the sense that its expected time is minimal over all such procedures.

Simpath K Kannan, Eugene W Myers
An Algorithm for Locating Non-overlaping Regions of Maximum Alignment Score
SIAM J Sci Comput, 25(3) 648-662 (1996)

Sun Wu, Udi Manber, Eugene W Myers
A Sub-Quadratic Algorithm for Approximate Regular Expression Matching
J of Algorithms, 19(3) 346-360 (1995)

James R Knight, Eugene W Myers
Approximate regular expression pattern matching with concave gap penalties
Algorithmica, 14(1) 85-121 (1995)

Gene Myers
Approximately matching context-free languages.
Inf Process Lett, 54(2) 85-92 (1995)

James R Knight, Eugene W Myers
Super-Pattern Matching
Algorithmica, 13(1) 211-243 (1995)

John D. Kececioglu, Eugene W Myers
Combinatorial algorithms for DNA sequence assembly
Algorithmica, 13(1) 7-51 (1995)

E W Myers
Toward simplifying and accurately formulating fragment assembly.
J Comput Biol, 2(2) 275-290 (1995)
The fragment assembly problem is that of reconstructing a DNA sequence from a collection of randomly sampled fragments. Traditionally, the objective of this problem has been to produce the shortest string that contains all the fragments as substrings, but in the case of repetitive target sequences this objective produces answers that are overcompressed. In this paper, the problem is reformulated as one of finding a maximum-likelihood reconstruction with respect to the two-sided Kolmogorov-Smirnov statistic, and it is argued that this is a better formulation of the problem. Next the fragment assembly problem is recast in graph-theoretic terms as one of finding a noncyclic subgraph with certain properties and the objectives of being shortest or maximally likely are also recast in this framework. Finally, a series of graph reduction transformations are given that dramatically reduce the size of the graph to be explored in practical instances of the problem. This reduction is very important as the underlying problems are NP-hard. In practice, the transformed problems are so small that simple branch-and-bound algorithms successfully solve them, thus permitting auxiliary experimental information to be taken into account in the form of overlap, orientation, and distance constraints.

Gene Myers, Webb Miller
Chaining multiple-alignment fragments in sub-quadratic time
In: Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms (1995), New York, ACM (1995), 38-47

Mudita Jain, E W Myers
A note on scoring clones given a probe ordering.
J Comput Biol, 2(1) 33-37 (1995)
We present an efficient algorithm for scoring clones given an ordering of probes under a schema proposed by Alizadeh et al. (1994) in the context of physical mapping with unique probes. The algorithm runs in time linear in the number of blocks of ones in the underlying sparse incidence matrix. A sparse and efficient algorithm for this task is important as it appears to be a central task in most algorithms for physical mapping.

Eugene W Myers
A sublinear algorithm for approximate keyword searching
Algorithmica, 12(4) 345-374 (1994)

Carolyn J Lawrence, S Honda, N W Parrott, T C Flood, L Gu, L Zhang, Mudita Jain, S Larson, E W Myers
The genome reconstruction manager: a software environment for supporting high-throughput DNA sequencing.
Genomics, 23(1) 192-201 (1994)
A new software system designed for use in high-throughput DNA sequencing laboratories is described. The Genome Reconstruction Manager (GRM) was developed from requirements derived from ongoing large-scale DNA sequencing projects. Object-oriented principles were followed in designing the system, and tools supporting object-oriented system development were employed for its implementation. GRM provides several advances in software support for high-throughput DNA sequencing: support for random, directed, and mixed sequencing strategies; a novel system for fragment assembly; a commercial object data-base management system for data storage; a client/server architecture for using network computational servers; and an underlying data model that can evolve to support fully automatic sequence reconstruction. GRM is currently being deployed for production use in high-throughput DNA sequencing projects.

Gerhard Mehldau, Gene Myers
A system for pattern matching applications on biosequences.
Comput Appl Biosci, 9(3) 299-314 (1993)
ANREP is a system for finding matches to patterns composed of (i) spacing constraints called 'spacers', and (ii) approximate matches to 'motifs' that are, recursively, patterns composed of 'atomic' symbols. A user specifies such patterns via a declarative, free-format and strongly typed language called A that is presented here in a tutorial style through a series of progressively more complex examples. The sample patterns are for protein and DNA sequences, the application domain for which ANREP was specifically created. ANREP provides a unified framework for almost all previously proposed biosequence patterns and extends them by providing approximate matching, a feature heretofore unavailable except for the limited case of individual sequences. The performance of ANREP is discussed and an appendix gives a concise specification of syntax and semantics. A portable C software package implementing ANREP is available via anonymous remote file transfer.

Udi Manber, Eugene W Myers
Suffix arrays: A new method for on-line string searches
SIAM J Sci Comput, 22(5) 935-948 (1993)
A new and conceptually simple data structure, called a suffix array, for on-line string searches is intro- duced in this paper. Constructing and querying suffix arrays is reduced to a sort and search paradigm that employs novel algorithms. The main advantage of suffix arrays over suffix trees is that, in practice, they use three to five times less space. From a complexity standpoint, suffix arrays permit on-line string searches of the type, ‘‘Is W a substring of A?’’ to be answered in time O(P + log N), where P is the length of W and N is the length of A, which is competitive with (and in some cases slightly better than) suffix trees. The only drawback is that in those instances where the underlying alphabet is finite and small, suffix trees can be constructed in O(N) time in the worst case, versus O(N log N) time for suffix arrays. However, we give an augmented algorithm that, regardless of the alphabet size, constructs suffix arrays in O(N) expected time, albeit with lesser space efficiency. We believe that suffix arrays will prove to be better in practice than suffix trees for many applications.

E W Myers, X Huang
An O (N² log N) restriction map comparison and search algorithm.
Bull Math Biol, 54(4) 599-618 (1992)
We present an O (R log P) time, O (M+P2) space algorithm for searching a restriction map with M sites for the best matches to a shorter map with P sites, where R, the number of matching site pairs, is bounded by MP. As first proposed by Waterman et al. (1984, Nucl. Acids Res. 12, 237-242) the objective function used to score matches is additive in the number of unaligned sites and the discrepancies in the distances between adjacent aligned sites. Our algorithm is basically a sparse dynamic programming computation in which "candidate lists" are used to model the future contribution of all previously computed entries to those yet to be computed. A simple modification to the algorithm computes the distance between two restriction maps with M and N sites, respectively, in O (MN (log M+log N)) time.

Gene Myers
A Four Russians algorithm for regular expression pattern matching
Journal of the ACM (JACM), 39(2) 432-448 (1992)

S F Altschul, W Gish, W Miller, E W Myers, D J Lipman
Basic local alignment search tool.
J Mol Biol, 215(3) 403-410 (1990)
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.

Sun Wu, Udi Manber, Gene Myers, Webb Miller
An O(NP) Sequence Comparison Algorithm
Inf Process Lett, 35(6) 317-323 (1990)

Eugene W Myers, Webb Miller
Row replacement algorithms for screen editors
ACM Trans Prog Lang and Systems, 11(1) 33-56 (1989)

E W Myers, W Miller
Approximate matching of regular expressions.
Bull Math Biol, 51(1) 5-37 (1989)
Given a sequence A and regular expression R, the approximate regular expression matching problem is to find a sequence matching R whose optimal alignment with A is the highest scoring of all such sequences. This paper develops an algorithm to solve the problem in time O(MN), where M and N are the lengths of A and R. Thus, the time requirement is asymptotically no worse than for the simpler problem of aligning two fixed sequences. Our method is superior to an earlier algorithm by Wagner and Seiferas in several ways. First, it treats real-valued costs, in addition to integer costs, with no loss of asymptotic efficiency. Second, it requires only O(N) space to deliver just the score of the best alignment. Finally, its structure permits implementation techniques that make it extremely fast in practice. We extend the method to accommodate gap penalties, as required for typical applications in molecular biology, and further refine it to search for sub-strings of A that strongly align with a sequence in R, as required for typical data base searches. We also show how to deliver an optimal alignment between A and R in only O(N + log M) space using O(MN log M) time. Finally, an O(MN(M + N) + N2log N) time algorithm is presented for alignment scoring schemes where the cost of a gap is an arbitrary increasing function of its length.

E W Myers, W Miller
Optimal alignments in linear space.
Comput Appl Biosci, 4(1) 11-17 (1988)
Space, not time, is often the limiting factor when computing optimal sequence alignments, and a number of recent papers in the biology literature have proposed space-saving strategies. However, a 1975 computer science paper by Hirschberg presented a method that is superior to the new proposals, both in theory and in practice. The goal of this paper is to give Hirschberg's idea the visibility it deserves by developing a linear-space version of Gotoh's algorithm, which accommodates affine gap penalties. A portable C-software package implementing this algorithm is available on the BIONET free of charge.

J D Hall, E W Myers
A software tool for finding locally optimal alignments in protein and nucleic acid sequences.
Comput Appl Biosci, 4(1) 35-40 (1988)
We describe software for aligning protein or nucleic acid sequences based on the concept of match density. This method is especially useful for locating regions of short similarity between two longer sequences which may be largely dissimilar (e.g. locating active site regions in distantly related proteins). Our software is able to identify biologically interesting similarities between two sub-regions because it allows the user to control the matching parameters and the manner in which local alignments are selected for display. Furthermore, the collection and ranking of alignments for display uses a novel, highly efficient algorithm. We illustrate these features with several examples. In addition, we show that this tool can be used to find a new conserved sequence in several viral DNA polymerases, which, we suggest, occurs at a functionally important enzymatic site.

W Miller, E W Myers
Sequence comparison with concave weighting functions.
Bull Math Biol, 50(2) 97-120 (1988)

Webb Miller, Eugene Myers
A Simple Row-replacement Method
Softw Pract Exp, 18(7) 597-611 (1988)
Updating a video screen involves row replacement, i.e. the task of updating an existing screen row to produce the desired row. In many environments, screen operations require transmitting characters to the terminal by a process that is painfully slow compared to computing speeds. Thus, it is worth while to compute a minimal set of row updating commands, as long as the time to do so does not outweigh the savings in character transmission time. This paper presents a simple and practical algorithm for optimal row replacement and describes experience with its use in a screen editor.

Christopher W. Fraser, Eugene W Myers
An Editor for Revision Control
ACM Trans Prog Lang and Systems, 9(2) 277-295 (1987)

Eugene W Myers
An O(ND) Difference Algorithm and Its Variations
Algorithmica, 1(1) 251-266 (1986)

Webb Miller, Eugene W Myers
Side-effects in Automatic File Updating
Softw Pract Exp, 16(9) 809-820 (1986)

E W Myers, D W Mount
Computer program for the IBM personal computer which searches for approximate matches to short oligonucleotide sequences in long target DNA sequences.
Nucleic Acids Res, 14(1) 501-508 (1986)
Open Access PDF
We describe a program which may be used to find approximate matches to a short predefined DNA sequence in a larger target DNA sequence. The program predicts the usefulness of specific DNA probes and sequencing primers and finds nearly identical sequences that might represent the same regulatory signal. The program is written in the C programming language and will run on virtually any computer system with a C compiler, such as the IBM/PC and other computers running under the MS/DOS and UNIX operating systems. The program has been integrated into an existing software package for the IBM personal computer (see article by Mount and Conrad, this volume). Some examples of its use are given.

Eugene Myers
SIAM J Sci Comput, 14(3) 625-637 (1985)
ItisanopenquestionincomputationalgeometryastowhetherthereexistsanO(ElogE+I) algorithm to determine the I intersections of a collection of E line segments in the plane. An approach utilizing a work list bubble sort and a distribution-based search is presented. The resulting algorithm has O(E log E + I) expected time complexity. In the worst case the algorithm has the same complexity as the algorithm of Bentley and Ottmann [IEEE Trans. Comput., 28 (1979), pp. 643-647]: O(E log E + I log E). The algorithm requires only O(E) space and in contrast to prior work, no restrictions are placed upon the nature of the intersections.

Webb Miller, Eugene Myers
A File Comparison Program
Softw Pract Exp, 15(11) 1026-1040 (1985)
This paper presents a simple method for computing a shortest sequence of insertion and deletion commands that converts one given file to another. The method is particularly efficient when the difference between the two files is small compared to the files' lengths. In experiments performed on typical files, the program often ran four times faster than the UNIX diff command.

Christopher W. Fraser, Eugene W Myers, Alan L. Wendt
Analyzing and Compressing Assembly Code
In: Proceedings of the SIGPLAN symposium on Compiler construction (1984) ACM SIGPLAN Notices, New York, ACM (1984), 117-121

Eugene W Myers
Efficient applicative data types
In: Proceedings of the 11th ACM SIGACT-SIGPLAN symposium on Principles of programming languages (1984), New York, ACM (1984), 66-75

Eugene W Myers
An applicative random-access stack
Inf Process Lett, 17(5) 241-248 (1983)

Leon J. Osterweil, Eugene W Myers
BIGMAC II: A FORTRAN language augmentation tool
In: Proceedings of the 5th international conference on Software engineering (1981), Piscataway, N.J., IEEE (1981), 410-421

Eugene W Myers
A precise inter-procedural data flow algorithm
In: Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages (1981), New York, ACM (1981), 219-230

Harold N Gabow, Eugene W Myers
SIAM J Sci Comput, 7(3) 280-287 (1978)
An algorithm for finding all spanning trees (arborescences) of a directed graph is presented. It uses backtracking and a method for detecting bridges based on depth-first search. The time required is O( V + E + EN) and the space is O( V + E), where V, E, and N represent the number of vertices, edges, and spanning trees, respectively. If the graph is undirected, the time decreases to O(V+E+VN), which is optimal to within a constant factor. The previously best-known algorithm for undirected graphs requires timeO(V+E+EN).