HaiBiostat: Machine Learning & Predictive Analytics

Hai Nguyen

What is Machine Learning

Meaningful data transformations from input to output data.
Transformations: represent or encode the data (RGB or HSV for color pixel).
Learning is automatic search for better data representations.
Search through a predefined space of possibilities using guidance from feedback signal.
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T, as measured by P, improves with experience E”

-Tom Mitchell, Machine Learning, McGraw Hill, 1997

Experience E, Task T, Performance P
1. Chess: T: playing chess, P: % of games won, E: playing practice games against itself.
2. Driving: T: driving a vehicle, P: avgdistance before error, E: sequence of images and steering commands recoded during manual driving.
3. Handwriting Recognition: T: recognizing and classifying handwritten words in images, P: % of correctly classified words, E: DB of handwritten words with given classifications.

Learning Types

Supervised
Unsupervised
Semi-supervised
Reinforcement
Transfer
Active

Supervised

The majority of practical machine learning uses supervised learning.
Supervised learning is where you have input variables (\(x\)) and an output variable (\(y\)) and you use an algorithm to learn the mapping function from the input to the output.
\[ y = f(x) \]
The goal is to approximate the mapping function so well that when you have new input data \(x\) that you can predict the output variables \(y\) for that data.
It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process.
We know the correct answers, the algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance.

Supervised Learning Examples:

Linear Regression
Logistic Regression
K-NN (k-Nearest Neighbors)
Support Vector Machines (SVMs)
Decision Tress and Random Forests
Neural Networks

Unsupervised Learning

Unsupervised learning is where you only have input data \(x\) and no corresponding output variables.
The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.
These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. Algorithms are left to their own devises to discover and present the interesting structure in the data.
Unsupervised Learning problems can be further grouped into Clustering and Association Problems.
Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy \(A\) also tend to buy \(B\).

Unsupervised Learning Examples:

Clustering
- K-Means
- Hierarchical Cluster Analysis (HCA)
- Expectation Maximization
Visualization and Dimensionality Reduction
- Principal Component Analysis (PCA)
- Kernel PCA
- t-distributed Stochastic Neighbor Embedding (t-SNE)
Association Rule
- Apriori
Neural Networks
- Autoencoders
- Boltzmann machines

Semi-supervised Learning

Semi-supervised learning is halfway between supervised and unsupervised learning.
Traditional classification methods use labeled data to build classifiers.
The labeled training sets used as input in Supervised learning is very certain and properly defined.
However, they are limited, expensive and takes a lot of time to generate them.
On the other hand, unlabeled data is cheap and is readily available in large volumes.
Hence, semi-supervised learning is learning from a combination of both labeled and unlabeled data
Where we make use of a combination of small amount of labeled data and large amount of unlabeled data to increase the accuracy of our classifiers.

Active Learning

Active learning (sometimes called “query learning” or “optimal experimental design” in the statistics literature) is a subfield of machine learning and, more generally, artificial intelligence.
The key hypothesis is that if the learning algorithm is allowed to choose the data from which it learns—to be “curious,” if you will—it will perform better with less training.
Active learning is a special case of semi-supervised learning.

Reinforcement Learning

Reinforcement Learning is learning what to do and how to map situations to actions.
The end result is to maximize the numerical reward signal.
The learner is not told which action to take, but instead must discover which action will yield the maximum reward

Transfer Learning

A machine learning technique where a model trained on one task is re-purposed on a second related task.
An optimization that allows rapid progress or improved performance when modeling the second task.

Universal Workflow of ML

Define the problem
Assemble dataset
Choose a metric to quantify project outcome
Decide on how to calculate the metric
Prepare dataset
Define standard baseline
Develop model that beats baseline
Ideal model is at the border of overfit and underfit–cross the border to know where it is so overfit model
Regularize model and tune hyperparameters

ML Terminologies

Dataset
- Training –Learn the parameters
- Validation –select hyperparameters
- Test –test the model aka generalization error
Batch –set of examples used in one iteration of model training.
Mini-batch –A small, randomly selected subset of the entire batch of examples run together in a single iteration of training or inference.
Epoch –A full training pass over the entire data set such that each example has been seen once.
Iteration –A single update of a model’s weights during training.

Data Assumptions

Training and test data are from the same probability distribution.
Training and test data are iid.

Overfitting and Underfitting

Overfitting –model fits very well to the training data, aka detects patterns in the noise also
1. Detect:
  - Low training error, high generalization error.
2. Remedies:
  - Reduce model capacity by removing features and/or parameters.
  - Get more training data.
  - Improve training data quality by reducing noise.
Underfitting–model too simple to detect patterns in the data
1. Detect
  - High training error.
2. Remedies:
  - Increase model capacity by adding more parameters and/or features.
  - Reduce model constraints.

Parametric & Nonparametric Models

\[ 𝑦= 𝑓(𝑥) \]

Estimate the unknown function 𝑓 as \(\hat{f}\)

Parametric Models:
1. Assume the functional form or shape of 𝑓
2. Apply methodology to train model
3. Advantage –simple estimation
4. Disadvantage – \(\hat{f}\)may be far from true 𝑓
Nonparametric Models:
1. No assumption on the functional form or shape of 𝑓
2. Estimate to fit as close as possible to the data
3. Advantage –can accurately fit a wide range of possible shapes of 𝑓
4. Disadvantage –need large datasets (since there is no fixed # of params to estimate)

Regression Analysis

OLS
- MSE
- Computational Complexity of matrix inversion
- Complete training set
Batch Gradient Descent
- Cost function (MSE)
- Learning rate hyperparameter
- Partial derivative
- Complete training set
Stochastic Gradient Descent
Mini-batch Gradient Descent

Linear Regression with OLS

\[ 𝑦= \theta^T𝑋\]

The cost function minimization is a closed-form solution called the Normal Equation: \[ \hat{\theta} = (X^T . X)^{-1} X^T.y \]

Advantage –equation is linear with size of training set so it can handle large training sets efficiently.
Disadvantage –
1. computational complexity of inverting a matrix that increases with size of training set.
2. difficult to do online learning with new data arriving regularly (need to recalculate estimates), i.e. no iterative parameter updates.

Use R or Python for machine learning?

There is something of a rivalry between the two most commonly used data science languages: R and Python. Of course, there are no machine learning tasks which are only possible to apply in one language or the other.

R is geared specifically for mathematical and statistical applications, i.e. R can focus purely on data, but may feel restricted if they ever need to build applications based on their models.
Currently, there are modern tools in R designed specifically to make data science tasks simple and human-readable, such as those from the tidyverse.
Previously, ML algorithms in R were scattered across multiple packages, written by different authors. But R has now followed suit, with the caret and mlr packages (which stands for machine learning in R). While quite similar in purpose and functionality to caret, mlr package provides an interface for a large number of machine learning algorithms, and allows you to perform extremely complicated machine learning tasks with very little coding.

Python:

First of all, some of the more cutting-edge deep learning approaches are easier to apply in Python (they tend to be written in Python first and implemented in R later).
Python, while very good for data science, is a more general purpose programming language.
Proponents of python could use this as en example of why it was better suited for machine learning, as it has the well known scikit-learn package which has a plethora of machine learning algorithms built into it.

Google Trends demonstrates the search interest relative to the highest point on the chart for the given region and over the past 5 years.

Machine Learning with `mlr` Package in R

R users got mlr package similar to Scikit-Learn from Python. The package synthesizes all the ML functions from other packages in which we can perform most of ML tasks. mlr package has several algorithms in its bouquet. These algorithms have been categorized into regression, classification, clustering, survival, multiclassification and cost sensitive classification:

library(mlr)
listLearners("classif")[c("class","package")]

                            class                   package
1                     classif.ada                 ada,rpart
2              classif.adaboostm1                     RWeka
3             classif.bartMachine               bartMachine
4                classif.binomial                     stats
5                classif.boosting              adabag,rpart
6                     classif.bst                 bst,rpart
7                     classif.C50                       C50
8                 classif.cforest                     party
9              classif.clusterSVM        SwarmSVM,LiblineaR
10                  classif.ctree                     party
11               classif.cvglmnet                    glmnet
12                 classif.dbnDNN                   deepnet
13                  classif.dcSVM            SwarmSVM,e1071
14                  classif.earth               earth,stats
15                 classif.evtree                    evtree
16             classif.extraTrees                extraTrees
17             classif.fdausc.glm                   fda.usc
18          classif.fdausc.kernel                   fda.usc
19             classif.fdausc.knn                   fda.usc
20              classif.fdausc.np                   fda.usc
21                classif.FDboost            FDboost,mboost
22            classif.featureless                       mlr
23                   classif.fgam                    refund
24                    classif.fnn                       FNN
25               classif.gamboost                    mboost
26               classif.gaterSVM                  SwarmSVM
27                classif.gausspr                   kernlab
28                    classif.gbm                       gbm
29                  classif.geoDA               DiscriMiner
30               classif.glmboost                    mboost
31                 classif.glmnet                    glmnet
32       classif.h2o.deeplearning                       h2o
33                classif.h2o.gbm                       h2o
34                classif.h2o.glm                       h2o
35       classif.h2o.randomForest                       h2o
36                    classif.IBk                     RWeka
37                    classif.J48                     RWeka
38                   classif.JRip                     RWeka
39                   classif.kknn                      kknn
40                    classif.knn                     class
41                   classif.ksvm                   kernlab
42                    classif.lda                      MASS
43       classif.LiblineaRL1L2SVC                 LiblineaR
44      classif.LiblineaRL1LogReg                 LiblineaR
45       classif.LiblineaRL2L1SVC                 LiblineaR
46      classif.LiblineaRL2LogReg                 LiblineaR
47         classif.LiblineaRL2SVC                 LiblineaR
48 classif.LiblineaRMultiClassSVC                 LiblineaR
49                  classif.linDA               DiscriMiner
50                 classif.logreg                     stats
51                  classif.lssvm                   kernlab
52                   classif.lvq1                     class
53                    classif.mda                       mda
54                    classif.mlp                     RSNNS
55               classif.multinom                      nnet
56             classif.naiveBayes                     e1071
57              classif.neuralnet                 neuralnet
58                   classif.nnet                      nnet
59                classif.nnTrain                   deepnet
60            classif.nodeHarvest               nodeHarvest
61                   classif.OneR                     RWeka
62                   classif.pamr                      pamr
63                   classif.PART                     RWeka
64              classif.penalized                 penalized
65                    classif.plr                   stepPlr
66             classif.plsdaCaret                 caret,pls
67                 classif.probit                     stats
68                    classif.qda                      MASS
69                  classif.quaDA               DiscriMiner
70           classif.randomForest              randomForest
71        classif.randomForestSRC           randomForestSRC
72                 classif.ranger                    ranger
73                    classif.rda                      klaR
74                 classif.rFerns                    rFerns
75                   classif.rknn                      rknn
76         classif.rotationForest            rotationForest
77                  classif.rpart                     rpart
78                    classif.RRF                       RRF
79                  classif.rrlda                     rrlda
80                 classif.saeDNN                   deepnet
81                    classif.sda                       sda
82              classif.sparseLDA sparseLDA,MASS,elasticnet
83                    classif.svm                     e1071
84                classif.xgboost                   xgboost

– ANALYTICS VIDHYA

The entire structure of this package relies on this premise:

\[\text{Create a Task. Make a Learner. Train Them.}\]

Creating a task means loading data in the package (e.g., makeClassifTask).
Making a learner means choosing an algorithm (makeLearner) which learns from task (or data).
Finally, train them (train).

References

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R (1st ed. 2013. ed.). New York, NY: Springer New York.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, Massachusetts: The MIT Press.

Hastie, T., Tibshirani, R., & Friedman, J. (2017). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. New York, NY: Springer New York.

Géron, A. l. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow concepts, tools, and techniques to build intelligent systems (First edition. ed.). Sebastopol, California: O’Reilly Media, Inc.

Rhys, H. (2020). Machine Learning with R, the tidyverse, and mlr (1st edition ed.): Manning Publications.

Comment on this article Share:

Machine Learning & Predictive Analytics

What is Machine Learning

Learning Types

Supervised

Unsupervised Learning

Semi-supervised Learning

Active Learning

Reinforcement Learning

Transfer Learning

Universal Workflow of ML

ML Terminologies

Data Assumptions

Overfitting and Underfitting

Parametric & Nonparametric Models

Regression Analysis

Linear Regression with OLS

Use R or Python for machine learning?

Machine Learning with `mlr` Package in R

References

Corrections

Reuse

Citation

Machine Learning & Predictive Analytics

What is Machine Learning

Learning Types

Supervised

Unsupervised Learning

Semi-supervised Learning

Active Learning

Reinforcement Learning

Transfer Learning

Universal Workflow of ML

ML Terminologies

Data Assumptions

Overfitting and Underfitting

Parametric & Nonparametric Models

Regression Analysis

Linear Regression with OLS

Use R or Python for machine learning?

Machine Learning with mlr Package in R

References

Corrections

Reuse

Citation

Machine Learning with `mlr` Package in R