Semi-automated Genome Annotation with Segway

Michael Hoffman
Princess Margaret Cancer Centre Toronto


Functional genomics efforts such as ENCODE and the Roadmap Epigenomics Project have described interactions between the genome and other biomolecules such as histones, transcription factors, and RNA, producing thousands of high-resolution genome-wide datasets.

Converting these data into biological conclusions remains an immense challenge, especially when trying to analyze multiple datasets simultaneously. To address this challenge, I created the Segway software, which transforms complex genomic datasets into easy-to-interpret labels. In this tutorial, we will discuss how to use Segway to perform unsupervised learning on multiple ENCODE ChIP-seq datasets. We will start by discussing the model behind Segway and when it is the right tool to use. We will then cover where to find data, preparation of signal data files, running Segway, and the interpretation of results. After the tutorial, you should be prepared to use Segway independently.


Michael Hoffman is a principal investigator at the Princess Margaret Cancer Centre, a University of Toronto affiliate, where he researches the application of machine learning techniques to epigenomic data. He previously led the National Institutes of Health ENCODE Project's large-scale integration task group while at the University of Washington. He has a PhD from the University of Cambridge, where he conducted computational genomics studies at the European Bioinformatics Institute. He also has a B.S. in Biochemistry and a B.A. in the Plan II Honors Program at The University of Texas at Austin. He was named a Genome Technology Young Investigator and has received several awards for his academic work, including a NIH K99/R00 Pathway to Independence Award.

When is Reproducibility an Ethical Issue? Genomics, Personalized Medicine, and Human Error

Keith Baggerly
MD Anderson Cancer Center


Modern high-throughput biological assays let us ask detailed questions about how diseases operate, and promise to let us personalize therapy. Careful data processing is essential, because our intuition about what the answers "should" look like is very poor when we have to juggle thousands of things at once. When documentation of such processing is absent, we must apply "forensic bioinformatics" to work from the raw data and reported results to infer what the methods must have been. We will present several case studies where simple errors may have put patients at risk. This work has prompted several journals to revisit the types of information that must accompany publications. We discuss steps we take to avoid such errors, and lessons that can be applied to large data sets more broadly. This work has been covered in the scientific press, on the front page of the New York Times, and on 60 Minutes.


Keith Baggerly is Professor of Bioinformatics and Computational Biology at the MD Anderson Cancer Center, where he has worked since 2000. His research involves modeling structure in high-throughput biological data.Dr. Baggerly is best known for his work on "forensic bioinformatics", in which careful reexamination of existing data shows the need for careful experimental design, preprocessing, and documentation. This work has been featured in Science, Nature, the New York Times, and 60 Minutes. In the past few years, his work led to a US Institute of Medicine (IOM) review of the level of evidence that should be required before omics-based tests are used to guide therapy in clinical trials. He has been profiled in the Journal of the National Cancer Institute, and is a Fellow of the American Statistical Association.