Oct 22, 2012
Assignment for CS 235, FA 2012
Instructor: Vagelis Hristidis
UCR
The assignment is individual, no groups allowed.
For this assignment you are asked to write a mini-paper about data cleaning for data that comes from web forms (many sample web forms are at http://www.formlogix.com/TagPage.aspx?tagId=2&tagName=Registration). In particular, you should propose some algorithms on how to detect and correct erroneous or missing values, in the possibly very big table that stores the web form user-entered data.
You may want to define different types of attributes, and how you would handle each of them, or how multiple attributes can be handled together.
The paper should have the following sections:
a. abstract (1 paragraph)
b. intro (half page): define problem informally, explain why it is hard, and give overview of your solution.
c. framework and problem definition (half page): here you define the setting formally, e.g., let table T with rows r1,…,rn,…, and the problem formally, e.g., Given T, compute T' such that ….
d. solution:
a. overview of solution algorithm(s) (1 paragraph). Note that the solution may be based on data mining or other techniques.
b. pseudocode (up to 20 lines)
c. example (half page): show two small tables, and how your algorithm works on them
e. complexity analysis (1 paragraph): what is (roughly) the asymptotic complexity of your algorithm? If there is user interaction or other parts that are tricky to estimate their complexity, discuss the complexity of the rest parts.
f. discussion (half page): limitations of your solution, setting when your approach would perform well/not well,possible future improvements
g. conclusions (1 paragraph)
The whole paper should be up to 4 pages including any figures or tables or citations (citations are not required), 11pt font size, in pdf.
You will be graded primarily on presentation, and secondarily on novelty/ideas.
Feel free to search for related work, but keep in mind that you are will be mainly graded on the presentation of the paper and not on the complexity of your solutions (although a smart and novel idea will be appreciated). A relatively simple idea with good presentation may get full points.
Also, read other research papers, e.g., published in ACM SIGMOD conference (http://dl.acm.org/citation.cfm?id=2213836&picked=prox, you need to be on campus to have access to the pdfs ), to see how each section should be written.
Tips:
make sure all your citations, tables, images are references in text.
make sure all symbols are define.
use examples to explain tricky parts of the paper.
do not try to impress by mentioning complex data mining methods, if their connection and application to the problem are not clearly explained.
first give a one-paragraph overview of your algorithm and then provide details and pseudocode.
Email your pdf to Shiwen and myself (in a single email) with email subject “assignment of CS 235”, and turn in hard copy in class on 11/5/2012.