I took Dr. Ernest Fokoue’s course Data Mining (STAT 747) in my Master study in RIT and gained tremendous fascinating modern Statistical Machine Learning technique skills. I want to share the marvelous essence of this course and my self-learning and self-reflection towards this course.

This course covers topics such as clustering, classification and regression trees, multiple linear regression under various conditions, logistic regression, PCA and kernel PCA, model-based clustering via mixture of gaussians, spectral clustering, text mining, neural networks, support vector machines, multidimensional scaling, variable selection, model selection, k-means clustering, k-nearest neighbors classifiers, statistical tools for modern machine learning and data mining, naïve Bayes classifiers, variance reduction methods (bagging) and ensemble methods for predictive optimality.

I will show the roadmap of this note in this post and follow the order. Basically, each post contains one essential data mining technique and later I will show some relative examples and exercises based on these methods.

Supervised Learning
Classification
Regression
Unsupervised Learning
Clustering Analysis
Factor Analysis
Topic Modeling
Recommender System

Application in Statistical Machine Learning

Handwritten Digit Recognition (MNIST)
Text Mining
Credit Scoring
Disease Diagonostics
Audio Processing
Speaker Recognition & Speaker Identification

Computing Tools in R

library(ctv)
library(MachineLearning)
library(HighPerformanceComputing)
library(TimeSeries)
library(Bayesian)
library(Robust)
library(biglm)
library(foreach)
library(glmnet)
library(kernlab)
library(randomForest)
library(ada)
library(audio)
library(rpart)
library(e1071)
library(MASS)
library(kernlab)

Important Aspects of Machine Learning

Machines Inherently designed to handle p larger than n problems

Classification and Regression Trees
Support Vector Machines
Relevance Vector Machines (n < 500)
Gaussian Process Learning Machines (n < 500)
k-Nearest Neighbors Learning Machines (Watch for the curse of dimensionality)
Kernel Machines in general
Machines can handle p larger than n problems if regularized with suitable constraints
Multiple Linear Regression Models
Generalized Linear Models
Discriminant Analysis
Ensemble Learning Machines
Random Subspace Learning Ensembles (Random Forest)
Boosting and its extensions

Note: Red parts in this Note Series remain questionable and I will update and add explanations for those parts as soon as I figure them out.

Researcher✨Qiuyi Wu

Data Mining Note 1 - Introduction

Application in Statistical Machine Learning

Computing Tools in R

Important Aspects of Machine Learning

Machines Inherently designed to handle p larger than n problems

Machines can handle p larger than n problems if regularized with suitable constraints

Ensemble Learning Machines