CS 234, W15: Computational Methods for the Analysis of Biomolecular Data
News
Overview
An impressive wealth of data has being ammassed by genome sequencing projects and other efforts to determine the structures and functions of biological systems. This advanced graduate course will focus on a selection of computational problems aimed at automatically analyze, cluster and classify biomolecular data.
Class Meeting
TR, 2:10 p.m. - 3:30 p.m. CHUNG 139
Office hours
Open door policy or by appointment (email me)
Preliminary list of topics
overview on probability and statistics intro to molecular and computational biology analysis of 1D sequence data (DNA, RNA, proteins) Space-efficient data structures for sequences Short read mapping (suffix trees, suffix arrays, BWT) Sequence alignment and hidden Markov models (HMM) analysis of 2D data (gene expression data and graphs) clustering algorithms classification algorithms subspace clustering/bi-clustering genetic networks, co-expression networks, metabolic networks, protein-protein interaction graphs
Prerequisites
CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.
Course Format
The course will include lectures by the instructor, guest lectures, and possibly discussion sessions on special problems. Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be three/four assignments, mostly of theoretical nature -- although some may require programming. The actual format of the course will ultimately depend on the number and the background of the students enrolled.
Relation to Other Courses
This course is intended to complement "CS238: Algorithms in Computational Molecular Biology", and "CS235: Data Mining Concepts".
References (books)
Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999. Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997. Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002 Warren J. Ewens, Gregory R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, 2001 An Introduction to Bioinformatics Algorithms, Neil C. Jones and Pavel Pevzner, the MIT Press, 2004. Understanding Bioinformatics, Marketa Zvelebil, Jeremy O. Baum, Garland Science, 2007
References (papers)
Anders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format] Paolo Ferragina, Giovanni Manzini, "Opportunistic Data Structures with Applications", FOCS 2000 [PDF format] Jeremy Buhler, Uri Keich, Yanni Sun, "Designing Seeds for Similarity Search in Genomic DNA", RECOMB 2003 [PDF format] Avak Kahvejian, John Quackenbush, John F Thompson, "What would you do if you could sequence everything?", Nature Biotechnology, 2008 [PDF format] Michael L. Metzker, "Sequencing technologies - the next generation", Nature Reviews Genetics, 2010 [PDF format]
Slides
Slides [PDF Format 2slides/page] (Course Overview) Slides [PDF Format 2slides/page] (Intro to Mol Biology) Slides [PDF Format 2slides/page] (Mol Biology Tools) Slides [PDF Format 2slides/page] (Indexing and Searching) Slides [PDF Format 2slides/page] (Probability Models and Inference)
Resources
RNAi animation (Nature Genetics) The inner life of a Cell DNA Molecular animation A bioinformatics glossary What's a Genome (on-line book) DNA interactive Experimental Genome Science (on-line course) Current Topics in Genome Analysis 2014 (on-line course) Fundamentals of Biology (on-line course) Pevzner's bioinformatics courses (on-line)
Projects
Project ideas and rules Sepideh Azarnoosh's CS 234 project webpage Kazi Islam's CS 234 project webpage Weihua Pan's CS 234 project webpage Leo Phong Vu's CS 234 project webpage Sawyer Masonjones's CS 234 project webpage Albert Do's CS 234 project webpage Suhas Sureshchandra's CS 234 project webpage Ashraful Arefeen's CS 234 project webpage Xing (Vic) Zhang's CS 234 project webpage Md. Abid Hasan's CS 234 project webpage Chetas Manjunath's CS 234 project webpage Yang Liu's CS 234 project webpage Abbas Roayaei Ardakany's CS 234 project webpage
Homework
Homework 1 (posted Jan 15, due Jan 29) Homework 2 (posted Jan 31, due Feb 17) Homework 2 solution Homework 3 (posted Feb 18, due Mar 3) Homework 3 solution
Midterm
Mock midterm exam (posted Feb 3)
Presentation
choose a paper among the Proceedings of RECOMB 2014 or ISMB 2014 and send the title to me and the slot number (1-13) when you want to present, see below send the Powerpoint file to me the day before the presentation (before 5pm) give the 16 minutes presentation (make sure you time it correctly, I will stop you at 16 mins)
Calendar of Lectures
Jan 6: Intro, Molecular Biology (1-21) Jan 8: Molecular Biology (22-43) Jan 13: Molecular Biology (44-65) Jan 15: Molecular Biology (66-86) [hw1 posted] Jan 20: Molecular Biology (87-end), Molecular Biology Tools Jan 22: Molecular Biology Tools Jan 27: Molecular Biology Tools Jan 29: Indexing/Searching (1-28) [hw1 due][hw2 posted] Feb 3: Indexing/Searching (29-) Feb 5: Indexing/Searching (-) Feb 10: Guest Lecture Feb 12: Indexing/Searching (-end), Probability Models (1-15) Feb 17: [hw2 due] [hw3 posted] Feb 19: Probability Models (16-) Feb 24: Probability Models, Biological Networks Feb 26: Biological Networks
First presentation (deadline for the PPT file is Feb 25th, 5PM)
1: Yang Liu (dipSPAdes: Assembler for Highly Polymorphic Diploid Genomes, RECOMB)
Mar 3 MIDTERM (80 minutes, in class, closed books, closed notes) [hw3 due]
Mar 5: More Presentations (deadline for the PPT file is Mar 4th, 5PM)
2: Suhas Sureshchandra (Methods for time series analysis of RNA-seq data with application to human Th17 cell differentiation, ISMB 2014)
3: Xing (Vic) Zhang (Deep learning of the tissue-regulated splicing code, ISMB 2014)
4: Abbas Roayaei (An Exact Algorithm to Compute the DCJ Distance for Genomes with Duplicate Genes)
5: Chetas Manjunath (RNA-Skim: a rapid method for RNA-Seq quantification at transcript level, ISMB 2014)Mar 10: More Presentations (deadline for the PPT file is Mar 9th, 5PM)
6: Md. Abid Hasan (CSAX: Characterizing Systematic Anomalies in eXpression Data, RECOMB 2014)
7: Weihua Pan (PASTA: Ultra-Large Multiple Sequence Alignment, RECOMB 2014)
8: Leo Phong Vu (Exact Learning of RNA Energy Parameters from Structure, RECOMB 2014)
9: Sawyer Masonjones (Cross-study validation for the assessment of prediction algorithms, ISMB 2014)Mar 12: More Presentations (deadline for the PPT file is Mar 11th, 5PM)
10: Kazi Islam (Robust clinical outcome prediction based on Bayesian analysis of transcriptional profiles and prior causal networks, ISMB'14)
11: Albert Do (Learning Protein-DNA Interaction Landscapes by Integrating Experimental Data through Computational Models, RECOMB 2014)
12: Sepideh Azarnoosh (Functional association networks as priors for gene regulatory network inference, ISMB 2014)
13: Ashraful Arefeen (Large scale analysis of signal reachability, ISMB'14)
Project Demo (20-25 minutes demo, 5-10 minutes questions, in my office, please bring your laptop)
Monday, March 16th
10:00 NAME
10:30 Leo
11:00 Xing (Vic) Zhang
11:30 Suhas Sureshchandra
Tuesday, March 17th
10:00 Yang Liu
10:30 Weihua
11:00 Md. Abid Hasan
11:30 Sawyer Masonjones
Wednesday, March 18th
9:30 Kazi Islam
10:00 Ashraful Arefeen
10:30 Albert Do
11:00 Chetas Manjunath
11:30 Abbas Roayaei