An Overview of Machine Learning & Predictive Analytics
Meaningful data transformations from input to output data.
Transformations: represent or encode the data (RGB or HSV for color pixel).
Learning is automatic search for better data representations.
Search through a predefined space of possibilities using guidance from feedback signal.
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T, as measured by P, improves with experience E”
-Tom Mitchell, Machine Learning, McGraw Hill, 1997
Experience E, Task T, Performance P
1. Chess: T: playing chess, P: % of games won, E: playing practice games against itself.
2. Driving: T: driving a vehicle, P: avgdistance before error, E: sequence of images and steering commands recoded during manual driving.
3. Handwriting Recognition: T: recognizing and classifying handwritten words in images, P: % of correctly classified words, E: DB of handwritten words with given classifications.
Supervised Learning Examples:
Clustering
and Association
Problems.Clustering
: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.Association
: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy \(A\) also tend to buy \(B\).Unsupervised Learning Examples:
Clustering
Visualization and Dimensionality Reduction
Association Rule
Neural Networks
Active learning (sometimes called “query learning” or “optimal experimental design” in the statistics literature) is a subfield of machine learning and, more generally, artificial intelligence.
The key hypothesis is that if the learning algorithm is allowed to choose the data from which it learns—to be “curious,” if you will—it will perform better with less training.
Active learning is a special case of semi-supervised learning.
Reinforcement Learning is learning what to do and how to map situations to actions.
The end result is to maximize the numerical reward signal.
The learner is not told which action to take, but instead must discover which action will yield the maximum reward
Dataset
Batch –set of examples used in one iteration of model training.
Mini-batch –A small, randomly selected subset of the entire batch of examples run together in a single iteration of training or inference.
Epoch –A full training pass over the entire data set such that each example has been seen once.
Iteration –A single update of a model’s weights during training.
Overfitting –model fits very well to the training data, aka detects patterns in the noise also
Detect:
Remedies:
Underfitting–model too simple to detect patterns in the data
Detect
Remedies:
\[ 𝑦= 𝑓(𝑥) \]
Estimate the unknown function 𝑓 as \(\hat{f}\)
Parametric Models:
Nonparametric Models:
OLS
Batch Gradient Descent
Stochastic Gradient Descent
Mini-batch Gradient Descent
\[ 𝑦= \theta^T𝑋\]
The cost function minimization is a closed-form solution called the Normal Equation: \[ \hat{\theta} = (X^T . X)^{-1} X^T.y \]
Advantage –equation is linear with size of training set so it can handle large training sets efficiently.
Disadvantage –
There is something of a rivalry between the two most commonly used data science languages: R and Python. Of course, there are no machine learning tasks which are only possible to apply in one language or the other.
R:
tidyverse
.caret
and mlr
packages (which stands for machine learning in R). While quite similar in purpose and functionality to caret, mlr
package provides an interface for a large number of machine learning algorithms, and allows you to perform extremely complicated machine learning tasks with very little coding.Python:
scikit-learn
package which has a plethora of machine learning algorithms built into it.Google Trends demonstrates the search interest relative to the highest point on the chart for the given region and over the past 5 years.
mlr
Package in RR users got mlr
package similar to Scikit-Learn from Python. The package synthesizes all the ML functions from other packages in which we can perform most of ML tasks. mlr
package has several algorithms in its bouquet. These algorithms have been categorized into regression, classification, clustering, survival, multiclassification and cost sensitive classification:
library(mlr)
listLearners("classif")[c("class","package")]
class package
1 classif.ada ada,rpart
2 classif.adaboostm1 RWeka
3 classif.bartMachine bartMachine
4 classif.binomial stats
5 classif.boosting adabag,rpart
6 classif.bst bst,rpart
7 classif.C50 C50
8 classif.cforest party
9 classif.clusterSVM SwarmSVM,LiblineaR
10 classif.ctree party
11 classif.cvglmnet glmnet
12 classif.dbnDNN deepnet
13 classif.dcSVM SwarmSVM,e1071
14 classif.earth earth,stats
15 classif.evtree evtree
16 classif.extraTrees extraTrees
17 classif.fdausc.glm fda.usc
18 classif.fdausc.kernel fda.usc
19 classif.fdausc.knn fda.usc
20 classif.fdausc.np fda.usc
21 classif.FDboost FDboost,mboost
22 classif.featureless mlr
23 classif.fgam refund
24 classif.fnn FNN
25 classif.gamboost mboost
26 classif.gaterSVM SwarmSVM
27 classif.gausspr kernlab
28 classif.gbm gbm
29 classif.geoDA DiscriMiner
30 classif.glmboost mboost
31 classif.glmnet glmnet
32 classif.h2o.deeplearning h2o
33 classif.h2o.gbm h2o
34 classif.h2o.glm h2o
35 classif.h2o.randomForest h2o
36 classif.IBk RWeka
37 classif.J48 RWeka
38 classif.JRip RWeka
39 classif.kknn kknn
40 classif.knn class
41 classif.ksvm kernlab
42 classif.lda MASS
43 classif.LiblineaRL1L2SVC LiblineaR
44 classif.LiblineaRL1LogReg LiblineaR
45 classif.LiblineaRL2L1SVC LiblineaR
46 classif.LiblineaRL2LogReg LiblineaR
47 classif.LiblineaRL2SVC LiblineaR
48 classif.LiblineaRMultiClassSVC LiblineaR
49 classif.linDA DiscriMiner
50 classif.logreg stats
51 classif.lssvm kernlab
52 classif.lvq1 class
53 classif.mda mda
54 classif.mlp RSNNS
55 classif.multinom nnet
56 classif.naiveBayes e1071
57 classif.neuralnet neuralnet
58 classif.nnet nnet
59 classif.nnTrain deepnet
60 classif.nodeHarvest nodeHarvest
61 classif.OneR RWeka
62 classif.pamr pamr
63 classif.PART RWeka
64 classif.penalized penalized
65 classif.plr stepPlr
66 classif.plsdaCaret caret,pls
67 classif.probit stats
68 classif.qda MASS
69 classif.quaDA DiscriMiner
70 classif.randomForest randomForest
71 classif.randomForestSRC randomForestSRC
72 classif.ranger ranger
73 classif.rda klaR
74 classif.rFerns rFerns
75 classif.rknn rknn
76 classif.rotationForest rotationForest
77 classif.rpart rpart
78 classif.RRF RRF
79 classif.rrlda rrlda
80 classif.saeDNN deepnet
81 classif.sda sda
82 classif.sparseLDA sparseLDA,MASS,elasticnet
83 classif.svm e1071
84 classif.xgboost xgboost
The entire structure of this package relies on this premise:
\[\text{Create a Task. Make a Learner. Train Them.}\]
makeClassifTask
).makeLearner
) which learns from task (or data).train
).James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R (1st ed. 2013. ed.). New York, NY: Springer New York.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, Massachusetts: The MIT Press.
Hastie, T., Tibshirani, R., & Friedman, J. (2017). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. New York, NY: Springer New York.
Géron, A. l. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow concepts, tools, and techniques to build intelligent systems (First edition. ed.). Sebastopol, California: O’Reilly Media, Inc.
Rhys, H. (2020). Machine Learning with R, the tidyverse, and mlr (1st edition ed.): Manning Publications.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Nguyen (2021, April 14). HaiBiostat: Machine Learning & Predictive Analytics. Retrieved from https://hai-mn.github.io/posts/2021-04-14-machine-learning-&-predictive-analytics/
BibTeX citation
@misc{nguyen2021machine, author = {Nguyen, Hai}, title = {HaiBiostat: Machine Learning & Predictive Analytics}, url = {https://hai-mn.github.io/posts/2021-04-14-machine-learning-&-predictive-analytics/}, year = {2021} }