Name: Waleed Amjad
Contact Information: wamja001@ucr.edu
Course: CS 234 (Professor Stefano Lonardi)
Research Interests: Databases, Information Retrieval,
Texting/Data Mining and Machine
Learning
Project Selected:
Metagenomics
binning
Language Selected
for Implementation:
JAVA
and MATLAB
Progress (as of March 10 2016)
Finished the implementation of shuffling or mixing of
n reads to pass on to clustering component
Completed the implementation of K-means clustering and
Naive Bayes classifier
Experiments are in progress.
Progress (as of February 25 2016)
Finished implementing generation of n reads, of length
l, with sequencing error a 1% rate. Currently, implementing shuffling or mixing
of n reads to pass on to clustering component.
Decided to use K-means clustering.
Collecting experimental data from GenBank to be used
in the project.
Progress (as of February 11 2016)
Selected first approach described below (in the
updated on January 27 2016)
Started implementing generation of n reads, of length
l, with sequencing error a 1% rate.
Investigating different clustering algorithm for high
dimensional data including K-means.
Also looking at dimensionality reduction using SVD
Progress (as of January 27 2016)
Reading and evaluating approaches including
Suggestion provided as part of project description: To use the
distribution of k-mers in each read, typically
4-mers. Represent the count of occurrences of each of the 64 possible 4-mers in
the read as a 64-dimentional vector, then use a clustering algorithm on these
vectors to decide where to assign the reads (e.g., k-means where k=m).
Machine
learning for metagenomics: methods and tools (2015)
http://arxiv.org/pdf/1510.06621.pdf
MBBC: an efficient approach for metagenomic binning based
on clustering (2015)
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0473-8