Data Mining and Data Warehousing

CSC 588 - Data Warehouse and Mining Systems

Semester:

Second

As an introductory course on data mining, this course introduces the concepts, algorithms, techniques, and systems of data warehousing and data mining, including (1) what is data mining? (2) get to know your data and data preprocessing, integration and transformation, (3) design and implementation of data warehouse and OLAP systems, (4) data cube technology, (5) mining frequent patterns and association: basic concepts and advanced methods, (6) classification and prediction: basic concepts and (7) cluster analysis: basic concepts. The course will serve both senior-level computer science undergraduate students and the first-year graduate students interested in the field. Also, the course may attract students from other disciplines who need to understand, develop, and use data warehouse and data mining systems to analyze large amounts of data.

CSC 588: Introduction to Data Warehousing and Data Mining

About the Course

Instructor
Dr. Fawzi Ibrahim

Lectures:
Monday: 8:00-10:30 am
Office hours:
TBA

Teaching Assistants
TBA
Office Location: TBA
Office hours:
TBA

Prerequisites
• Background: "Data Structure and Software Principles" or consent of instructor (good statistics and machine learning knowledge will help better understanding the course materials).
• Programming: We will give one programming assignments. You will need to be familiar with at least one programming language, such as C++, or Java. We will not cover programming-specific issues in this course.

Textbook
• Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann, 2006.

References
The following texts are recommended but not required. There are numerous other books or online resources on data mining available.
• E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011.
• T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer-Verlag, 2009.
• T. M. Mitchell, Machine Learning, McGraw Hill, 1997.
• P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison Wesley, 2005.
• I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005.

Lecture slides contain most technical briefing and reference materials. Please study the materials in class preparation and class review. There are many research papers that will help understand the course contents. Please check the references of this course to obtain further information

Source codes and implementations of data mining algorithms
• Source codes for Frequent Pattern Mining, Clustering, Time Series and Web Mining allgorithms implemented by Chinese Univ. of Hong Kong: http://appsrv.cse.cuhk.edu.hk/~kdd/program.html
• FIMI workshops: Datasets and source codes for frequent itemset mining implementations: http://fimi.cs.helsinki.fi/
• Frequent itemset mining algorithm implementations by Bart Goethal: http://www.adrem.ua.ac.be/~goethals/software/
• Repository of implementations of UIUC data mining research package: IlliMine: http://illimine.cs.uiuc.edu/
• Weka: Weka 3 - Data Mining with Open Source Machine Learning Software in Java: http://www.cs.waikato.ac.nz/ml/weka/
• Graph mining algorithm implemtations: gSpan and CloseGraph

Course Format, Activities, Evaluation
This course will draw materials mainly from the textbook, the course slides are important references. Students will study the materials and complete all the course requirements.

Reading: Before and After Classes
We encourage students to read ahead, before lectures for the materials to be discussed. Please check the schedule page to see what will be covered in each lecture before the class begins.

Homework and programming assignments
There will about 3 assignments, spaced out over the course of the semester. Among these assignments, one (or at least part of it) will be a programming assignment.

Examinations
There will be three exams: Two midterm exams each will be 1.5 hours in length, and the final will be 3 hours in length. We will not normally give make-ups for missed exams.

Evaluation
We plan to determine final grades of the course in the following way:
• Assignments: 6% (2 homework assignments, 3% each)
• Quizzes: 6%. (3 quizzes, 2% each)
• Lab work: 3% (attendance and lab work)
• Two Midterm exams: 30% (First exam 10%, second exam 20%)
• Final exam: 40%
• Project: Option1: survey (10%) + assignment 3 (5%)
• Option 2: Software project or research project(15 %)

Course project

You can choose one of the following options:
1. Survey: (2-3 students)
Writing a focused, comprehensive survey on a focused topic of data warehouse or data mining, for example, a survey on data warehouse architectures, clustering methods, or Frequent Itemsets techniques. You will need to make a talk by the end of the year (no power point presentation is required). For this option you will be required to do and submit assignment 3.

2. Data mining software function maker or a full data warehouse: (4-5 students)
Implementing one high-performance, fully documented open source data mining function maker or a full data warehouse application, as discussed in the textbook, in Java or C++ (or any programming language that you may prefer). This should include a user-interface and visualization package. You will be required to write a report and do a presentation. Whoever decides to go with this one will be exempted from assignment 3 with its mark to be added to the project mark.
[Note: copying online open source software is considered as plagiarism!]

3. Research Project: (3-4 students)
You can also propose and work out a research project. In this project you compare two or three algorithms and try to study the time, accuracy, or space performance of the different algorithms under comparison. You may come up with a conclusion from your results about the best algorithm to use and in which cases. You will be required to write a report and do a presentation. You will be exempted from assignment 3 with its mark to be added to the project mark.

Project Schedule

1. One page proposal (week 3) (1%)
One page project proposal, with name, title, abstract and reference list should be handed in for comments and feedbacks.

2. Mid-term review (week 9) (4%)
Check the progress of the project. Discuss with TA about your problems, progress, and further work.

3. Final submission (week 16) (10%)
Submit a final report or survey, talks and presentations.

Class Schedule for CSC 588
This page provides our class schedule for previous semesters. This semester the schedule may be modified slightly based on the progress of the class.

Week#                           Topic                                                                  Assignment Out                  Assignment Due
1                                    Class Outline / Chapter 1: Introduction
1                                      Chapter 1: Introduction                                       Introduction.ppt
2                                      Chapter 1: Introduction                                         Assign#1
2                                      Chapter 1: Introduction

3                                      Chapter 2: Data Preprocessing                             Pre-Processing
3                                      Chapter 2: Data Preprocessing
4                                      Chapter 2: Data Preprocessing
4                                      Chapter 2: Data Preprocessing                           One page proj. proposal
5                                      Chapter 2: Data Preprocessing
5                                      Chapter 2: Data Preprocessing
6                                      Chapter 2: Data Preprocessing
6                                      Chapter 3: Data Warehousing and Data Cube                                                  Assign#1 Due
7                                      Chapter 3: Data Warehousing and Data Cube
7                                      First Midterm exam                                               Assign#2
8                                      Chapter 3: OLAP
9                                      Break
10                                   Chapter 3: OLAP                                                                                               Mid term proj. review
10                                    Chapter 3: OLAP
11                                    Chapter 3: OLAP
11                                    Chapter 5: Mining Frequent Patterns
12                                    Chapter 5: Mining Frequent Patterns                  Assign#3                                Assign#2 Due
12                                    Second Midterm exam
13                                    Chapter 5: Mining Frequent Patterns
13                                    Chapter 6: Classification: Basic Concepts
14                                    Chapter 6: Classification: Basic Concepts
14                                    Chapter 6: Classification: Basic Concepts
15                                    Chapter 7: Cluster Analysis: Basic Concepts                                                    Assign#3 due
15                                    Chapter 7: Cluster Analysis: Basic Concepts
16                                    Presentations
16                                    Presentations                                                                                                   Projects due date

17 Final Exam