Machine Learning & Predictive Analytics

Biostatistics Machine Learning

An Overview of Machine Learning & Predictive Analytics

Hai Nguyen
April 14, 2021

What is Machine Learning

  1. Meaningful data transformations from input to output data.

  2. Transformations: represent or encode the data (RGB or HSV for color pixel).

  3. Learning is automatic search for better data representations.

  4. Search through a predefined space of possibilities using guidance from feedback signal.

  5. “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T, as measured by P, improves with experience E”

    -Tom Mitchell, Machine Learning, McGraw Hill, 1997

Experience E, Task T, Performance P
1. Chess: T: playing chess, P: % of games won, E: playing practice games against itself.
2. Driving: T: driving a vehicle, P: avgdistance before error, E: sequence of images and steering commands recoded during manual driving.
3. Handwriting Recognition: T: recognizing and classifying handwritten words in images, P: % of correctly classified words, E: DB of handwritten words with given classifications.

Learning Types

Supervised

  1. The majority of practical machine learning uses supervised learning.
  2. Supervised learning is where you have input variables (\(x\)) and an output variable (\(y\)) and you use an algorithm to learn the mapping function from the input to the output.
    \[ y = f(x) \]
  3. The goal is to approximate the mapping function so well that when you have new input data \(x\) that you can predict the output variables \(y\) for that data.
  4. It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process.
  5. We know the correct answers, the algorithm iteratively makes predictions on the training data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable level of performance.

Supervised Learning Examples:

Unsupervised Learning

  1. Unsupervised learning is where you only have input data \(x\) and no corresponding output variables.
  2. The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.
  3. These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. Algorithms are left to their own devises to discover and present the interesting structure in the data.
  4. Unsupervised Learning problems can be further grouped into Clustering and Association Problems.
  5. Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
  6. Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy \(A\) also tend to buy \(B\).

Unsupervised Learning Examples:

Semi-supervised Learning

  1. Semi-supervised learning is halfway between supervised and unsupervised learning.
  2. Traditional classification methods use labeled data to build classifiers.
  3. The labeled training sets used as input in Supervised learning is very certain and properly defined.
  4. However, they are limited, expensive and takes a lot of time to generate them.
  5. On the other hand, unlabeled data is cheap and is readily available in large volumes.
  6. Hence, semi-supervised learning is learning from a combination of both labeled and unlabeled data
  7. Where we make use of a combination of small amount of labeled data and large amount of unlabeled data to increase the accuracy of our classifiers.

Active Learning

Reinforcement Learning

Transfer Learning

  1. A machine learning technique where a model trained on one task is re-purposed on a second related task.
  2. An optimization that allows rapid progress or improved performance when modeling the second task.

Universal Workflow of ML

  1. Define the problem
  2. Assemble dataset
  3. Choose a metric to quantify project outcome
  4. Decide on how to calculate the metric
  5. Prepare dataset
  6. Define standard baseline
  7. Develop model that beats baseline
  8. Ideal model is at the border of overfit and underfit–cross the border to know where it is so overfit model
  9. Regularize model and tune hyperparameters

ML Terminologies

  1. Dataset

    • Training –Learn the parameters
    • Validation –select hyperparameters
    • Test –test the model aka generalization error
  2. Batch –set of examples used in one iteration of model training.

  3. Mini-batch –A small, randomly selected subset of the entire batch of examples run together in a single iteration of training or inference.

  4. Epoch –A full training pass over the entire data set such that each example has been seen once.

  5. Iteration –A single update of a model’s weights during training.

Data Assumptions

  1. Training and test data are from the same probability distribution.
  2. Training and test data are iid.

Overfitting and Underfitting

  1. Overfitting –model fits very well to the training data, aka detects patterns in the noise also

    1. Detect:

      • Low training error, high generalization error.
    2. Remedies:

      • Reduce model capacity by removing features and/or parameters.
      • Get more training data.
      • Improve training data quality by reducing noise.
  2. Underfitting–model too simple to detect patterns in the data

    1. Detect

      • High training error.
    2. Remedies:

      • Increase model capacity by adding more parameters and/or features.
      • Reduce model constraints.

Parametric & Nonparametric Models

\[ 𝑦= 𝑓(𝑥) \]

Estimate the unknown function 𝑓 as \(\hat{f}\)

Regression Analysis

  1. OLS

    • MSE
    • Computational Complexity of matrix inversion
    • Complete training set
  2. Batch Gradient Descent

    • Cost function (MSE)
    • Learning rate hyperparameter
    • Partial derivative
    • Complete training set
  3. Stochastic Gradient Descent

  4. Mini-batch Gradient Descent

Linear Regression with OLS

\[ 𝑦= \theta^T𝑋\]

The cost function minimization is a closed-form solution called the Normal Equation: \[ \hat{\theta} = (X^T . X)^{-1} X^T.y \]

Use R or Python for machine learning?

There is something of a rivalry between the two most commonly used data science languages: R and Python. Of course, there are no machine learning tasks which are only possible to apply in one language or the other.

R:

Python:

Google Trends demonstrates the search interest relative to the highest point on the chart for the given region and over the past 5 years.

Machine Learning with mlr Package in R

R users got mlr package similar to Scikit-Learn from Python. The package synthesizes all the ML functions from other packages in which we can perform most of ML tasks. mlr package has several algorithms in its bouquet. These algorithms have been categorized into regression, classification, clustering, survival, multiclassification and cost sensitive classification:

library(mlr)
listLearners("classif")[c("class","package")]
                            class                   package
1                     classif.ada                 ada,rpart
2              classif.adaboostm1                     RWeka
3             classif.bartMachine               bartMachine
4                classif.binomial                     stats
5                classif.boosting              adabag,rpart
6                     classif.bst                 bst,rpart
7                     classif.C50                       C50
8                 classif.cforest                     party
9              classif.clusterSVM        SwarmSVM,LiblineaR
10                  classif.ctree                     party
11               classif.cvglmnet                    glmnet
12                 classif.dbnDNN                   deepnet
13                  classif.dcSVM            SwarmSVM,e1071
14                  classif.earth               earth,stats
15                 classif.evtree                    evtree
16             classif.extraTrees                extraTrees
17             classif.fdausc.glm                   fda.usc
18          classif.fdausc.kernel                   fda.usc
19             classif.fdausc.knn                   fda.usc
20              classif.fdausc.np                   fda.usc
21                classif.FDboost            FDboost,mboost
22            classif.featureless                       mlr
23                   classif.fgam                    refund
24                    classif.fnn                       FNN
25               classif.gamboost                    mboost
26               classif.gaterSVM                  SwarmSVM
27                classif.gausspr                   kernlab
28                    classif.gbm                       gbm
29                  classif.geoDA               DiscriMiner
30               classif.glmboost                    mboost
31                 classif.glmnet                    glmnet
32       classif.h2o.deeplearning                       h2o
33                classif.h2o.gbm                       h2o
34                classif.h2o.glm                       h2o
35       classif.h2o.randomForest                       h2o
36                    classif.IBk                     RWeka
37                    classif.J48                     RWeka
38                   classif.JRip                     RWeka
39                   classif.kknn                      kknn
40                    classif.knn                     class
41                   classif.ksvm                   kernlab
42                    classif.lda                      MASS
43       classif.LiblineaRL1L2SVC                 LiblineaR
44      classif.LiblineaRL1LogReg                 LiblineaR
45       classif.LiblineaRL2L1SVC                 LiblineaR
46      classif.LiblineaRL2LogReg                 LiblineaR
47         classif.LiblineaRL2SVC                 LiblineaR
48 classif.LiblineaRMultiClassSVC                 LiblineaR
49                  classif.linDA               DiscriMiner
50                 classif.logreg                     stats
51                  classif.lssvm                   kernlab
52                   classif.lvq1                     class
53                    classif.mda                       mda
54                    classif.mlp                     RSNNS
55               classif.multinom                      nnet
56             classif.naiveBayes                     e1071
57              classif.neuralnet                 neuralnet
58                   classif.nnet                      nnet
59                classif.nnTrain                   deepnet
60            classif.nodeHarvest               nodeHarvest
61                   classif.OneR                     RWeka
62                   classif.pamr                      pamr
63                   classif.PART                     RWeka
64              classif.penalized                 penalized
65                    classif.plr                   stepPlr
66             classif.plsdaCaret                 caret,pls
67                 classif.probit                     stats
68                    classif.qda                      MASS
69                  classif.quaDA               DiscriMiner
70           classif.randomForest              randomForest
71        classif.randomForestSRC           randomForestSRC
72                 classif.ranger                    ranger
73                    classif.rda                      klaR
74                 classif.rFerns                    rFerns
75                   classif.rknn                      rknn
76         classif.rotationForest            rotationForest
77                  classif.rpart                     rpart
78                    classif.RRF                       RRF
79                  classif.rrlda                     rrlda
80                 classif.saeDNN                   deepnet
81                    classif.sda                       sda
82              classif.sparseLDA sparseLDA,MASS,elasticnet
83                    classif.svm                     e1071
84                classif.xgboost                   xgboost

ANALYTICS VIDHYA

The entire structure of this package relies on this premise:

\[\text{Create a Task. Make a Learner. Train Them.}\]

References

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R (1st ed. 2013. ed.). New York, NY: Springer New York.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, Massachusetts: The MIT Press.

Hastie, T., Tibshirani, R., & Friedman, J. (2017). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. New York, NY: Springer New York.

Géron, A. l. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow concepts, tools, and techniques to build intelligent systems (First edition. ed.). Sebastopol, California: O’Reilly Media, Inc.

Rhys, H. (2020). Machine Learning with R, the tidyverse, and mlr (1st edition ed.): Manning Publications.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Nguyen (2021, April 14). HaiBiostat: Machine Learning & Predictive Analytics. Retrieved from https://hai-mn.github.io/posts/2021-04-14-machine-learning-&-predictive-analytics/

BibTeX citation

@misc{nguyen2021machine,
  author = {Nguyen, Hai},
  title = {HaiBiostat: Machine Learning & Predictive Analytics},
  url = {https://hai-mn.github.io/posts/2021-04-14-machine-learning-&-predictive-analytics/},
  year = {2021}
}