Data Mining and Data Warehousing
CSC 588 - Data Warehouse and Mining Systems
Second
As an introductory course on data mining, this course introduces the concepts, algorithms, techniques, and systems of data warehousing and data mining, including (1) what is data mining? (2) get to know your data and data preprocessing, integration and transformation, (3) design and implementation of data warehouse and OLAP systems, (4) data cube technology, (5) mining frequent patterns and association: basic concepts and advanced methods, (6) classification and prediction: basic concepts and (7) cluster analysis: basic concepts. The course will serve both senior-level computer science undergraduate students and the first-year graduate students interested in the field. Also, the course may attract students from other disciplines who need to understand, develop, and use data warehouse and data mining systems to analyze large amounts of data.
CSC 588: Introduction to Data Warehousing and Data Mining
About the Course
Instructor
Dr. Fawzi Ibrahim
Lectures:
Monday: 8:00-10:30 am
Office hours:
TBA
Teaching Assistants
TBA
Office Location: TBA
Office hours:
TBA
Prerequisites
• Background: "Data Structure and Software Principles" or consent of instructor (good statistics and machine learning knowledge will help better understanding the course materials).
• Programming: We will give one programming assignments. You will need to be familiar with at least one programming language, such as C++, or Java. We will not cover programming-specific issues in this course.
Textbook
• Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann, 2006.
References
The following texts are recommended but not required. There are numerous other books or online resources on data mining available.
• E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011.
• T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer-Verlag, 2009.
• T. M. Mitchell, Machine Learning, McGraw Hill, 1997.
• P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison Wesley, 2005.
• I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005.
Lecture slides contain most technical briefing and reference materials. Please study the materials in class preparation and class review. There are many research papers that will help understand the course contents. Please check the references of this course to obtain further information
Source codes and implementations of data mining algorithms
• Source codes for Frequent Pattern Mining, Clustering, Time Series and Web Mining allgorithms implemented by Chinese Univ. of Hong Kong: http://appsrv.cse.cuhk.edu.hk/~kdd/program.html
• FIMI workshops: Datasets and source codes for frequent itemset mining implementations: http://fimi.cs.helsinki.fi/
• Frequent itemset mining algorithm implementations by Bart Goethal: http://www.adrem.ua.ac.be/~goethals/software/
• Repository of implementations of UIUC data mining research package: IlliMine: http://illimine.cs.uiuc.edu/
• Weka: Weka 3 - Data Mining with Open Source Machine Learning Software in Java: http://www.cs.waikato.ac.nz/ml/weka/
• Graph mining algorithm implemtations: gSpan and CloseGraph
Course Format, Activities, Evaluation
This course will draw materials mainly from the textbook, the course slides are important references. Students will study the materials and complete all the course requirements.
Reading: Before and After Classes
We encourage students to read ahead, before lectures for the materials to be discussed. Please check the schedule page to see what will be covered in each lecture before the class begins.
Homework and programming assignments
There will about 3 assignments, spaced out over the course of the semester. Among these assignments, one (or at least part of it) will be a programming assignment.
Examinations
There will be three exams: Two midterm exams each will be 1.5 hours in length, and the final will be 3 hours in length. We will not normally give make-ups for missed exams.
Evaluation
We plan to determine final grades of the course in the following way:
• Assignments: 6% (2 homework assignments, 3% each)
• Quizzes: 6%. (3 quizzes, 2% each)
• Lab work: 3% (attendance and lab work)
• Two Midterm exams: 30% (First exam 10%, second exam 20%)
• Final exam: 40%
• Project: Option1: survey (10%) + assignment 3 (5%)
• Option 2: Software project or research project(15 %)
Course project
You can choose one of the following options:
1. Survey: (2-3 students)
Writing a focused, comprehensive survey on a focused topic of data warehouse or data mining, for example, a survey on data warehouse architectures, clustering methods, or Frequent Itemsets techniques. You will need to make a talk by the end of the year (no power point presentation is required). For this option you will be required to do and submit assignment 3.
2. Data mining software function maker or a full data warehouse: (4-5 students)
Implementing one high-performance, fully documented open source data mining function maker or a full data warehouse application, as discussed in the textbook, in Java or C++ (or any programming language that you may prefer). This should include a user-interface and visualization package. You will be required to write a report and do a presentation. Whoever decides to go with this one will be exempted from assignment 3 with its mark to be added to the project mark.
[Note: copying online open source software is considered as plagiarism!]
3. Research Project: (3-4 students)
You can also propose and work out a research project. In this project you compare two or three algorithms and try to study the time, accuracy, or space performance of the different algorithms under comparison. You may come up with a conclusion from your results about the best algorithm to use and in which cases. You will be required to write a report and do a presentation. You will be exempted from assignment 3 with its mark to be added to the project mark.
Project Schedule
1. One page proposal (week 3) (1%)
One page project proposal, with name, title, abstract and reference list should be handed in for comments and feedbacks.
2. Mid-term review (week 9) (4%)
Check the progress of the project. Discuss with TA about your problems, progress, and further work.
3. Final submission (week 16) (10%)
Submit a final report or survey, talks and presentations.
Class Schedule for CSC 588
This page provides our class schedule for previous semesters. This semester the schedule may be modified slightly based on the progress of the class.
Week# Topic Assignment Out Assignment Due
1 Class Outline / Chapter 1: Introduction
1 Chapter 1: Introduction Introduction.ppt
2 Chapter 1: Introduction Assign#1
2 Chapter 1: Introduction
3 Chapter 2: Data Preprocessing Pre-Processing
3 Chapter 2: Data Preprocessing
4 Chapter 2: Data Preprocessing
4 Chapter 2: Data Preprocessing One page proj. proposal
5 Chapter 2: Data Preprocessing
5 Chapter 2: Data Preprocessing
6 Chapter 2: Data Preprocessing
6 Chapter 3: Data Warehousing and Data Cube Assign#1 Due
7 Chapter 3: Data Warehousing and Data Cube
7 First Midterm exam Assign#2
8 Chapter 3: OLAP
9 Break
10 Chapter 3: OLAP Mid term proj. review
10 Chapter 3: OLAP
11 Chapter 3: OLAP
11 Chapter 5: Mining Frequent Patterns
12 Chapter 5: Mining Frequent Patterns Assign#3 Assign#2 Due
12 Second Midterm exam
13 Chapter 5: Mining Frequent Patterns
13 Chapter 6: Classification: Basic Concepts
14 Chapter 6: Classification: Basic Concepts
14 Chapter 6: Classification: Basic Concepts
15 Chapter 7: Cluster Analysis: Basic Concepts Assign#3 due
15 Chapter 7: Cluster Analysis: Basic Concepts
16 Presentations
16 Presentations Projects due date
17 Final Exam