CS 234: Computational Methods for the Analysis of Biomolecular Data
News
Overview
An impressive wealth of data has being ammassed by genome/metagenome/epigenetic projects and other efforts to determine the structure and function of molecular biological systems. This advanced graduate course will focus on a selection of computational problems aimed at automatically analyze biomolecular data.
Class Meeting
MW, 2:00pm - 3:30pm, Skye 170
Office hours
By appointment via Zoom (email me)
Preliminary list of topics
intro to molecular and computational biology, including biotech tools overview on probability and statistics analysis of 1D sequence data (DNA, RNA, proteins) Space-efficient data structures for sequences Short read mapping (suffix trees, suffix arrays, BWT) Sequence alignment and hidden Markov models (HMM) analysis of 2D data (gene expression data and graphs) clustering algorithms classification algorithms subspace clustering/bi-clustering genetic networks, co-expression networks, metabolic networks, protein-protein interaction graphs
Prerequisites
CS141 (Algorithms) or CS218 (Design and Analysis of Algorithms) or equivalent knowledge. Some programming experience is expected. Students should have some notions of probability and statistics. No biology background is assumed.
Course Format
The course will include lectures by the instructor and presentations by the students. Students are expected to study the material covered in class. In addition to selected chapters from some of the books listed below, there may be handouts of research papers. There will be three assignments, mostly of theoretical nature -- although some may require a bit of programming.
Relation to Other Courses
This course is intended to complement "CS238: Algorithms in Computational Molecular Biology", and "CS235: Data Mining Concepts".
References (books)
Richard Durbin, A. Krogh, G. Mitchison, and S. Eddy, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999. Dan Gusfield, Algorithms on Strings, Trees and Sequences - Computer Science and Computational Biology, Cambridge University Press, 1997. Dan E. Krane, Michael L. Raymer, Fundamental Concepts of Bioinformatics, Benjamin Cummings 2002 Warren J. Ewens, Gregory R. Grant, Statistical Methods in Bioinformatics: An Introduction, Springer, 2001 Neil C. Jones and Pavel Pevzner, An Introduction to Bioinformatics Algorithms, MIT Press, 2004. Marketa Zvelebil, Jeremy O. Baum, Understanding Bioinformatics, Garland Science, 2007
References (papers)
SlidesAnders Krogh, "An introduction to hidden Markov models for biological sequences" [PDF format] Paolo Ferragina, Giovanni Manzini, "Opportunistic Data Structures with Applications", FOCS 2000 [PDF format] Jeremy Buhler, Uri Keich, Yanni Sun, "Designing Seeds for Similarity Search in Genomic DNA", RECOMB 2003 [PDF format] Avak Kahvejian, John Quackenbush, John F Thompson, "What would you do if you could sequence everything?", Nature Biotechnology, 2008 [PDF format] Michael L. Metzker, "Sequencing technologies - the next generation", Nature Reviews Genetics, 2010 [PDF format]
Slides [PDF Format 2slides/page] (Course Overview) Slides [PDF Format 2slides/page] (Intro to Mol Biology) Slides [PDF Format 2slides/page] (Mol Biology Tools) Slides [PDF Format 2slides/page] (Indexing and Searching) Slides [PDF Format 2slides/page] (Probability Models and Inference) Slides [PDF Format 2slides/page] (Bio Networks)
Resources
CS 234 Fold it! group RNAi animation (Nature Genetics) DNA Molecular animation DNA interactive Genomic Data Science Specialization (Coursera) Bioconductor for Genomic Data Science (Coursera) Genome Sequencing (Bioinformatics II) (Coursera) Introduction to Genomics (NHGRI) Fundamentals of Biology (on-line course) Pevzner's bioinformatics courses (Coursera)
Projects
Project ideas and rules create your CS 234 webpage on Google Kuntal Pal's project Faisal Bin Ashraf's project Aakash Saha's project Omer Eren's project Xianghu Wang's project Rui Yang's project Amun Patel's project Jingong Huang's project Michael Strobel's project Ankit Gupta's project Yuta Nakamura's project JiaJun Yu's project Jay Hemnani's project Xiao Gao's project Priyanshu Sharma's project Jay Hemnani's project Zizhuo Wang's project Mohammed Armughanuddin's project Guoyao Hao's project Rui Ma's project Baoju Wang's project
Homework
Homework 1 (posted Oct 10, due Oct 19, midnight), LaTeX Homework 2 (posted Oct 19, due Oct 31, midnight), LaTeX Homework 2 solution Homework 3 (posted Oct 31, due Nov 9, midnight), LaTeX
Midterm
Mock midterm exam
Presentation
Choose a paper among the Proceedings of RECOMB 2022 (or earlied RECOMB editions) or ISMB 2022 and use the sign up sheet to reserve a spot You can choose a recent computional molecular biology paper from a top journal like Nature, Science, Cell, Genome Biology, Genome Research, but it has to have a significant computational component, and you will need my OK Email me the Powerpoint file the day before the presentation (before 5pm) Give the 15 minutes presentation (make sure you time it correctly, I will stop you at 15 mins, we will reserve a minimum of 2 minutes for questions)
Calendar of Lectures
Week 1Sep 26: Intro, Molecular Biology (1-21) Sep 28: Molecular Biology (22-59) Week 2Oct 3: Molecular Biology (60-94) Oct 5: Molecular Biology (94-end), Molecular Biology Tools (1-42) Week 3Oct 10: Molecular Biology Tools (43-56) [hw1 posted] Oct 12: Molecular Biology Tools (43-56) Indexing (1-23) Week 4Oct 17: Indexing (24-66) Oct 19: Indexing (67-86) [hw1 due][hw2 posted] Week 5Oct 24: Indexing (87-end) Oct 26: Probability Models (1-) Week 6Oct 31: Probability Models () [hw2 due][hw3 posted] Nov 2: Probability Models () Week 7Nov 7: Networks () Nov 9: Networks () [hw3 due] Week 8Nov 14: Midterm Nov 16: Presentations (deadline for the PPT file is Nov 15th, 5PM) Week 9Nov 21: Presentations (deadline for the PPT file is Nov 20th, 5PM)
Nov 23: Presentations (deadline for the PPT file is Nov 22nd, 5PM)
Week 10Nov 28: Presentations (deadline for the PPT file is Nov 27th, 5PM)
Nov 30: Presentations (deadline for the PPT file is Nov 29th, 5PM)
Project Demo (20-25 minutes demo, 5-10 minutes questions, via zoom): here. Please use the zoom meeting ID 972 8807 8095