CS 234, Winter 2008: Computational Methods for the Analysis of Biomolecular Data
A staggering wealth of data has being generated by genome sequencing projects and other efforts to determine the structures and functions of biological systems. This advanced graduate course will focus on a selection of computational problems aimed at automatically analyze, cluster and classify biomolecular data.
Class Meeting
TR 2:10 pm-3:30 pm OLMH 1126
Office hours
F 11am-12noon, Engineering II room 317
Preliminary list of topics
overview on probability and statistics intro to molecular and computational biology analysis of 1D sequence data (DNA, RNA, proteins) combinatorial algorithms and statistical methods for pattern discovery and sequence alignment sequence alignment and hidden Markov models (HMM) analysis of 2D data (gene expression data and graphs) clustering algorithms classification algorithms subspace clustering/bi-clustering genetic networks, co-expression networks, metabolic networks, protein-protein interaction graphs
Prerequisites
CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.
Course Format
The course will include lectures by the instructor, guest lectures, and possibly discussion sessions on special problems. Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be three/four assignments, mostly of theoretical nature -- although some may require programming. The actual format of the course will ultimately depend on the number and the background of the students enrolled.
Relation to Other Courses
This course is intended to complement "CS238: Algorithms in Computational Molecular Biology", and "CS235: Data Mining Concepts".
References (books)
Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999. Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997. Pierre Baldi, Soren Brunak, Bioinformatics: the machine learning approach, MIT press, 1998. Joao Setubal and Joao Carlos Meidanis Introduction to Computational Molecular Biology, PWS Publishing Co., 1997. Jason Wang, Bruce A. Shapiro, and Dennis Shasha, Pattern Discovery in Biomolecular Data Tools, Techniques, and Applications, Oxford University Press, 1999. David Mount, Bioinformatics: Sequence and Genome Analysis Cold Spring Harbor Laboratory Press, 2002 Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002 Warren J. Ewens, Gregory R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, 2001 An Introduction to Bioinformatics Algorithms, Neil C. Jones and Pavel Pevzner, the MIT Press, 2004.
References (papers)
Anders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format] Brona Brejova, Chrysanne DiMarco, Tomas Vinar, Sandra Romero Hidalgo, Gina Holguin, Cheryl Patten. "Finding Patterns in Biological Sequences". Unpublished TR. University of Waterloo, 2000 [PDF format] Alberto Apostolico, Mary Ellen Bock, Stefano Lonardi, Xuyan Xu, "Efficient Detection of Unusual Words", Journal of Computational Biology, vol.7, no.1/2, pp.71-94, 2000 [PDF format] Gesine Reinert, Sophie Schbath, Michael S. Waterman, "Probabilistic and Statistical Properties of Words: An Overview", Journal of Computational Biology, vol.7, no.1/2, 2000 [PDF format] Todd Mood, "The Expectation-Maximization Algorithm", IEEE Signal Processing Magazine, Nov 1996 [PDF Format] Jeff A. Bilmes, "A Gentle Tutorial of the EM Algorithm and its Applications to Parameter Estimation for Gaussian Mixture and HMM", UC Berkley, TR-97-021 [PDF Format] C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, J. C. Wootton, "Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment", Science 262, 1993 [PDF Format] Jun S. Liu, Andrew F. Neuwald, Charles E. Lawrence, "Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies", Journal of the American Statistical Association, 90(432), 1995 [PDF Format]
Slides
Slides [PDF Format 2slides/page] (Course Overview) Slides [PDF Format 2slides/page] (Intro to Mol Biology) Slides [PDF Format 2slides/page] (Some basic probability) Slides [PDF Format 2slides/page] (Intro to Pattern Discovery) Slides [PDF Format 2slides/page] (Discovery of Rigid Patterns) Slides [PDF Format 2slides/page] (HMM) Slides [PDF Format 2slides/page] (Microarrays) Slides [PDF Format 2slides/page] (Biological networks)
Resources
The inner life of a Cell DNA Molecular animation A bioinformatics glossary What's a Genome (on-line book) DNA interactive Primer on Molecular Genetics PMP Resources
Projects
Yuling Li's CS234 webpage Jianxia Ning's CS234 webpage Kai Tien Cheng's CS234 webpage Cameron Allen's CS234 webpage Chih-Ming Yen's CS234 webpage Wenyu Huo's CS234 webpage Qiang Zhu's CS234 webpage Wanxing Xu's CS234 webpage Monik Khare's CS234 webpage Sahar Nohzadeh-Malakshah's CS234 webpage Anne Hansen's CS234 webpage Inci Cetindil's CS234 webpage Abdullah Mueen's CS234 webpage Peter Lonjers's CS234 webpage
Homework
Homework 1 (posted Jan 15, due Jan 29) Homework 2 (posted Jan 29, due Feb 12) Homework 3 (posted Feb 13, due Feb 26)
Presentation
choose a slot 1-12 below and send me your choice choose a paper among RECOMB 2007 or ISMB/ECCB 2007 proceedings and send the title to me send the Powerpoint file to me the day before the presentation (before 5pm) give the 15 minutes presentation (make sure you time it correctly, I will stop you after 15mins)
Calendar of Lectures
Jan 8: Intro, Molecular Biology Jan 10: Molecular Biology Jan 15: Molecular Biology [hw1 posted] Jan 17: Molecular Biology Jan 22: Molecular BIology, Intro to Stat Jan 24: Intro to Stat, Pattern Discovery Jan 29: Pattern Discovery [hw1 due, hw2 posted] Jan 31: Guest lecture by V. Vacic Feb 5: Pattern Discovery Feb 7: Pattern Discovery Feb 12: Pattern Discovery [hw3 posted, hw2 due] Feb 14: HMMs Feb 19: MIDTERM (in class, closed books, closed notes) Feb 21: HHMs Feb 26: Microarrays [hw3 due] Feb 28: Microarrays, Networks Mar 4: Networks and Presentation (deadline for the PPT file is Mar 3, 5PM)
1: Jianxia Ning (An ensemble framework for clustering protein-protein interaction networks, ISMB/ECCB 2007)
2: Wenyu Huo (Support Vector Training of Protein Alignment Models, RECOMB'07)
Mar 6: Presentations. (deadline for the PPT file is Mar 5, 5PM)
3: Abdullah Al Mueen (Clustering Short Gene Expression Data, RECOMB'06)
4: Wanxing Xu (Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences, RECOMB 2007)
5: Qiang Zhu (A statistical method for alignment-free comparison of regulatory sequences, ISMB/ECCB 2007)
6: Yuling Li (A quantitative model for linking two disparate sets of articles in MEDLINE, ISMB 2007)
Mar 11: Presentations. (deadline for the PPT file is Mar 10, 5PM)
7: Chih-Ming Yen (GPDTI: A Genetic Programming Decision Tree Induction method ..., ISMB'07)
8: Sahar Nohzadeh-Malakshah (Meta-analysis of gene expression data: a predictor-based approach, ISMB'07)
9: Monik Khare (Homology search for genes, ISMB'07)
10: Anne Hansen (A Bayesian Model That Links Microarray mRNA Measurements to Mass Spectrometry Protein Measurements, RECOMB 2007)
Mar 13: Presentations. (deadline for the PPT file is Mar 12, 5PM)
11: Kevin Cheng (A graph-based approach to systematically reconstruct human transcriptional regulatory modules)
12: Peter Lonjers (Comparing association network algorithms for reverse engineering of large-scale gene regulatory networks: synthetic versus real data, ISMB'07)
13: Inci Cetindil (Negation of protein-protein interactions: analysis and extraction)
14: Cameron Allen (Murlet: a practical multiple alignment tool for structural RNA sequences)
Project Demo (in my office, please bring your laptop)
Mar 18: 9:30 Wanxing Xu
10:00 Jianxia Ning
10:30 Inci
11:00 NAME
11:30 Wenyu HuoMar 19: 9:30 Sahar Nohzadeh-Malakshah
10:00 Qiang Zhu
10:30 Chih-Ming Yen
11:00 Anne Hansen
11:30 Monik KhareMar 20: 9:30 Abdullah Al Mueen
10:00 Peter Lonjers
10:30 Kai Tien Cheng
11:00 Yuling Li
11:30 Cameron Allen