You must form groups of two students. If you cannot find a partner, please email Shiwen, who can find you a partner.
There are two types of projects:
Your project proposal must be approved before you proceed with your project.
Your project grade does not only depend on if you addressed all items in your proposal, but also on the overall complexity and interestingness of your project.
The project deliverables should be submitted as hard copy (except the source code for software projects) in class on 12/5/2012 and be emailed (including source code) the same day to Shiwen and Vagelis
Guidelines for Software projects
In project proposal, you must include what dataset you plan to use, what problem you will solve, and how you will evaluate your solution.
A software project discovers or leverages interesting relationships within a significant amount of data. Best if the project leverages what we have learned in class.
A typical project involves:
1. Selecting one or more datasets, e.g., from http://archive.ics.uci.edu/ml/datasets.html, tweets, http://www.kaggle.com/, http://data.gov, or other source.
2. Define a problem on these data. E.g., if you have a dataset of demographics, you may study what attributes (e.g., income, age, zipcode, race) are correlated, if an existing classifier performs well, if you need to do any special preprocessing of the data, what is the meaning of clustering in the dataset by different clustering algorithms, if there are interesting patterns, how do you handle missing or dirty data, and so on.
3. Solve the problem. If the problem is sufficiently complex (e.g., using multiple datasets or tricky preprocessing or crawling the web to get the data), then you may use data mining packages (e.g. WEKA). Else, you should implement the data mining algorithms yourself, in any programming language. Make clear in your report what existing software you are using.
4. Evaluate your solution.
Project ideas (assuming you are able to find the right datasets):
Create Web spam classifier
Find attributes of a user profile in a social network that influence their choice of friends or groups
Find keywords that co-occur in Tweets or that are correlated with various holidays
Find products to recommend for bundling in e-commerce sites like Amazon.com
Tell something useful about a collection of documents -- Web pages, news articles, reviews, blogs, e.g. Possible goals include identifying sentiment (is a review positive or negative?), telling wise blogs from foolish, telling real news from publicity releases
Cluster patients by their symptoms
Predict how many tweets a user will submit in a week
The deliverables of a software project are:
1. A project report in pdf (file name should contain the last names of all group members), about 10 pages in any format you like, that includes most of the below, plus other material if needed:
data description
problem definition
data preprocessing
data mining algorithms used and why
evaluation, graphs of experiments, result tables
screenshots if the program has an interesting user interface
discussion on what was hard to achieve, limitations
observations, conclusions
2. A zip file with the source code.
Guidelines for Survey project
In project proposal, you must include the topic description and the list of papers you will survey.
First, you need to pick an interesting topic related to Data Mining, where there has been adequate amount of research. Use Google Scholar to find the most important papers in this area (look for papers with many citations). Also consider commercial systems or products in your topic.
Select about 5 papers for the survey. The papers selection must be part of the project proposal.
The deliverable is the survey paper in pdf (file name should contain the last names of all group members), which is 12 pages formatted as described in http://www.acm.org/sigs/publications/proceedings-templates
The survey must identify the common and the different characteristics across the papers, and present them in a coherent and integrated way, and not as just one paper per section.
Example of survey topics are: