Data Mining Note 1 - Introduction

I took Dr. Ernest Fokoue’s course Data Mining (STAT 747) in my Master study in RIT and gained tremendous fascinating modern Statistical Machine Learning technique skills. I want to share the marvelous essence of this course and my self-learning and self-reflection towards this course.

This course covers topics such as clustering, classification and regression trees, multiple linear regression under various conditions, logistic regression, PCA and kernel PCA, model-based clustering via mixture of gaussians, spectral clustering, text mining, neural networks, support vector machines, multidimensional scaling, variable selection, model selection, k-means clustering, k-nearest neighbors classifiers, statistical tools for modern machine learning and data mining, naïve Bayes classifiers, variance reduction methods (bagging) and ensemble methods for predictive optimality.

I will show the roadmap of this note in this post and follow the order. Basically, each post contains one essential data mining technique and later I will show some relative examples and exercises based on these methods.

  • Supervised Learning
    Classification
    Regression
  • Unsupervised Learning
    Clustering Analysis
    Factor Analysis
    Topic Modeling
    Recommender System

Application in Statistical Machine Learning

  • Handwritten Digit Recognition (MNIST)
  • Text Mining
  • Credit Scoring
  • Disease Diagonostics
  • Audio Processing
  • Speaker Recognition & Speaker Identification

Computing Tools in R

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
library(ctv)
library(MachineLearning)
library(HighPerformanceComputing)
library(TimeSeries)
library(Bayesian)
library(Robust)
library(biglm)
library(foreach)
library(glmnet)
library(kernlab)
library(randomForest)
library(ada)
library(audio)
library(rpart)
library(e1071)
library(MASS)
library(kernlab)

Important Aspects of Machine Learning

Machines Inherently designed to handle p larger than n problems
  • Classification and Regression Trees
  • Support Vector Machines
  • Relevance Vector Machines (n < 500)
  • Gaussian Process Learning Machines (n < 500)
  • k-Nearest Neighbors Learning Machines (Watch for the curse of dimensionality)
  • Kernel Machines in general
    Machines can handle p larger than n problems if regularized with suitable constraints
  • Multiple Linear Regression Models
  • Generalized Linear Models
  • Discriminant Analysis
    Ensemble Learning Machines
  • Random Subspace Learning Ensembles (Random Forest)
  • Boosting and its extensions



Note: Red parts in this Note Series remain questionable and I will update and add explanations for those parts as soon as I figure them out.

0%