Luis Caballero Diaz's profile

Machine Learning Supervised Algorithms

This project focuses on assessing the most common machine learning supervised algorithms working with Python and using an open dataset in kaggle. The code and dataset can be found in the below links.

(use cross_validation = 0)

This data is very extensive containing information for 70k patients collecting 12 features for each, plus a target output class indicating if the patient was detected with cardiovascular disease or not. 

First thing prior to start using algorithms is to understand and scrub the input data. Therefore, let's plot the features to get familiar with the dataset under analysis. However, when the number of features is high, visualizing and understanding the data might be hard. This time, a histogram per each feature is considered.

HISTOGRAM FOR ORIGINAL DATA
The histogram is a good tool to have a preliminary overview of the dataset. For example with a fast look, the lower age seems to lead an output = 0. In the same way, cholesterol = 0 seems to lead to output = 0 too. Instead, features such as smoke or alco look uncorrelated to the output class. However, as general comment, the dataset does not look to have a strong correlation between the output class and the input features, so creating a model close to the best score does not look achievable for the dataset under analysis. 

Additionally to the potential low correlation, it can be observed a considerable amount of outliers, specially in the features age, height, weight, ap_lo and ap_hi. To confirm the outliers in these features, the boxplot is depicted below.
In light of the irregularities in the dataset, data scrubbing is needed to clean data and correct some inaccurate and wrong data. Next list summarizes the main performed actions to scrub the input dataset.

- There is a feature call ID which does not have any meaningful information, it is just a list of the samples and so, it should be removed.

- In general, datasets looks to have outliers which should be removed. the above boxplot clearly shows the outliers content.

- There are numerical features which are in fact categorical since they are not continuous. For example, gender is 1 or 2 for male or female and it could not be 1.5. The best way for categorical features, even they are numerical, is to apply one hot encoding.

- There are some linked features which must meet some rules. For example, the diastolic pressure blood should not be higher than the systolic pressure blood.

- Also, not identifiable in the histogram, but it is required to check for empty and duplicate rows.

HISTOGRAM FOR SCRUBBED DATA
The scrubbed histogram looks better without the outliers and other inaccurate or wrong samples. The one hot encoding appliance can be checked since there is no "Gender" feature any more, now there are "Gender_1" and "Gender_2" features which can take 1 or 0. The same applies for "Gluc" and "Cholesterol".

Next step is to create the testing and training sets. The split is to use a 20% of the original dataset as testing set and the remaining 80% to train the model. One important thing here is to identify the relation between output class = 0 and class = 1 in each dataset. If the training or testing sets has low values of one particular class, the results might lead to wrong conclusions. To do that, firstly, it is required to assess the class distribution in the scrubbed data to check if the dataset is unbalanced. 

The distribution for the current dataset is good since there are 24091 samples for class = 0 vs 23458 samples for class = 1. Therefore, let's proceed to split the training and testing sets shuffling the original dataset and aiming the same class distribution with the stratified operation, which leads to 6023 for class = 0 vs 5865 for class = 1 in the testing set. Below there is a distribution plot between both sets.

Note stratified operation splits the original dataset keeping constant the class distribution in both training and testing sets. Thus, in the current exercise both training and testing sets have a 50.7% of data with output class = 0 (and so both also have a 49.3% of data with output class = 1), which is the same as the original dataset. 
Once the data is ready to apply the algorithms, let's introduce the main supervised algorithm in machine learning. Note the next summary only tries to provide an overview of the main algorithms.

1. K NEAREST NEIGHBORS
This algorithm is one of the simplest algorithms in machine learning. Building the model consists only of storing the training dataset. To make a prediction for a new data point, the algorithm finds the closest data points in the training dataset called nearest neighbors.

Considering a single nearest neighbor, the prediction on the training set is perfect. Instead, the testing set accuracy when a single neighbor is used is lower, indicating the model is too complex, leading to overfitting. However, when more neighbors are considered, the model becomes simpler, and the training accuracy drops, but improving the score for new data prediction.

This algorithm can be used for both regression and classification taking the mean class or value of the n nearest neighbors.

The parameter to tune the algorithm is the number of neighbors considering that:
Lower neighbors --> Complex model --> Higher likelihood to overfit

One of the strengths of k-NN is that the model is very easy to understand, and often gives reasonable performance without a lot of adjustments. Building the nearest neighbors model is usually very fast, but when your training set is very large (either in number of features or in number of samples) prediction can be slow.

2. LINEAR MODELS
For datasets with many features, linear models can be very powerful. 

There are linear models to be used in regression exercises and other linear models to be used in classification exercises, both are explained below.

2.1. LINEAR REGRESSION ORDINARY LEAST SQUARES
Linear regression is the simplest and most classic linear method for regression. Linear regression finds the parameters that minimize the mean squared error between predictions and the true regression targets. The mean squared error is the sum of the squared differences between the predictions and the true values..

Linear regression has no parameters, which is a benefit, but it also has no way to control model complexity. With higher dimensional datasets, linear models become more powerful. If controlling model complexity is required, Ridge or Lasso regression can be used instead.

2.2. RIDGE REGRESSION (L2 REGULARIZATION)
Ridge regression is also a linear model for regression, which the coefficients are chosen to have the magnitude of coefficients to be as small as possible. This means each feature should have as little effect on the outcome as possible. This constraint is an example of regularization, which means restricting a model to avoid overfitting. Ridge regression uses L2 regularization.

The Ridge model makes a trade off between the simplicity of the model (near zero coefficients) and its performance on the training set. That trade off is controlled by the parameter alpha, which works as follows.

Higher alpha --> Higher regularization --> Smaller coefficients --> Less likely to overfit
Lower alpha --> Lower regularization --> Bigger coefficients --> More likely to overfit

Therefore, if alpha is very low, meaning the regularization is null, Ridge and Linear regression offer the same performance. 

2.3. LASSO REGRESSION (L1 REGULARIZATION)
Lasso regression also restricts coefficients to be close to zero, but using L1 regularization which causes some coefficients are exactly zero, which makes the model easier to interpret and can reveal the most important features of your model. Thus, if you have many features and expect only a few of them to be important, Lasso might be a good choice.

Like Ridge, the Lasso also has a regularization parameter, alpha, that controls how strongly coefficients are pushed toward zero in the same way as Ridge regression. Similarly, if alpha is very low, both Ridge and Lasso would work as Linear regression due to the lack of regularization.

In general, L1 is more aggressive than L2, and so Ridge would be a preferred option prior to Lasso.

2.4. LOGISTIC REGRESSION AND LINEAR SVC (SUPPORT VECTOR CLASSIFIER)
Linear models are also extensively used for classification with a formula very similar to the one for linear regression, but instead of returning the weighted sum of the features, the predicted value is compared to certain threshold for a binary class exercise.

Linear SVC and Logistic Regression are the most common linear models for classification exercises. Both work by defining a boundary among classes. By default, both models apply an L2 regularization, in the same way that Ridge does for regression, but algorithm implementation allows selecting both L1 and L2 regularization. Both algorithms have a control parameter to define the model complexity and the likelihood to overfit. That is the parameter C, which defines the strength of the regularization according as follows.

Higher values of C correspond to less regularization --> Try to fit the training set as best as possible --> More likely to overfit
Lower values of the parameter C --> The models put more emphasis on finding a coefficient vector that is close to zero --> Less likely to overfit

Like the case of regression, linear models for classification might seem very restrictive in low dimensional spaces, only allowing for decision boundaries that are straight lines or planes. Instead, in high dimensions, linear models for classification become very powerful, and avoid overfitting becomes increasingly important when considering more features.

Note this algorithm can be also used for multiclass classification using the one vs rest classifiers method. The methodology is to create one boundary per each class and then select the most meaningful class per each test dataset.

3. NAIVE BAYES
Naive Bayes classifiers are similar to the linear models discussed in the previous section. However, they tend to be faster in training, but they are slightly worse than Logistic Regression and Linear SVC.

The operating principle for this algorithm is to learn parameters by looking at each feature individually and collect simple per-class statistics from each feature. To make a prediction, a data point is compared to the statistics for each of the classes, and the best matching class is predicted. The three main Naïve Bayes algorithms are as follows:

GaussianNB can be applied to any continuous data and stores the average value as well as the standard deviation of each feature for each class. Mostly used on very high dimensional data
BernoulliNB assumes binary data and counts how often every feature of each class is not zero. Mostly used in text data classification
MultinomialNB assumes count data (that is, that each feature represents an integer count of something, like how often a word appears in a sentence) and considers the average value of each feature for each class. Mostly used in text data classification too

4. DECISION TREE
It is an algorithm based on building a decision tree. To build a tree, the algorithm searches over all possible if/else questions in the dataset, called test, and finds the one that is most informative about the target variable. Then, the tree is split in two different nodes or leaves depending on the test results. Then, another question is applied to each node to split each again in two more nodes. This recursive process yields a binary tree of decisions, with each node containing a test. The recursive partitioning of the data is repeated until finding a stopping criterion or all leaves are pure. A leaf of the tree that contains data points that all share the same target value is called pure.

Typically, building a tree until all leaves are pure leads to models that are very complex and highly overfit to the training data. The presence of pure leaves mean that a tree is 100% accurate on the training set; each data point in the training set is in a leaf that has the correct majority class. Unpruned trees are therefore prone to overfitting and not generalizing well to new data.

There are two common strategies to prevent overfitting: stopping the creation of the tree early (also called pre-pruning) or building the tree but then removing or collapsing nodes that contain little information (also called post-pruning or just pruning). Possible criteria for pre-pruning include limiting the maximum depth of the tree, limiting the maximum number of leaves, or requiring a minimum number of points in a node to keep splitting it. 

5. RANDOM FOREST
The main drawback of decision trees is that they tend to overfit the training data. Random forests are one way to address this problem. A random forest is essentially a collection of decision trees, where each tree is slightly different from the others.

The idea behind random forests is that each tree might do a relatively good job of predicting but will likely be overfit on part of the data. If many trees are built, all of which work well and overfit in different ways, the amount of overfitting can be reduced by averaging their results. 

To implement this strategy, many decision trees needs to be built. Each tree should do an acceptable job of predicting the target and should also be different from the other trees. Random forests get their name from injecting randomness into the tree building to ensure each tree is different. There are two ways in which the trees in a random forest are randomized: by selecting a subgroup of data points used to build a tree and by selecting a subgroup of features in each split test.

6. GRADIENT BOOSTING
The gradient boosted regression tree is another ensemble method that combines multiple decision trees to create a more powerful model. In contrast to the random forest approach, gradient boosting works by building trees in a serial manner, where each tree tries to correct the mistakes of the previous one. By default, there is no randomization in gradient boosted regression trees; instead, strong pre-pruning is used. Gradient boosted trees often use very shallow trees, of depth one to five, which makes the model smaller in terms of memory and makes predictions faster.

Apart from the pre-pruning and the number of trees in the ensemble, another important parameter of gradient boosting is the learning_rate, which controls how strongly each tree tries to correct the mistakes of the previous trees. A higher learning rate means each tree can make stronger corrections, allowing for more complex models. Adding more trees to the ensemble, which can be accomplished by increasing n_estimators, also increases the model complexity, as the model has more chances to correct mistakes on the training set.

7. KERNELIZED SUPPORT VECTOR MACHINE
The main idea here is that adding nonlinear features to the representation of our data can make linear models much more powerful. Thus, the kernel trick helps directly to compute the distance of the data points for the expanded feature representation, without ever actually computing the expansion. Accurately splitting classes in a linear space might be not possible with a simple line depending on the dataset, but hyperplanes can be used when increasing space dimensionality helping to optimize the classification.

In practice, the algorithm focuses on the subset of the training points lying on the border between classes to define the decision boundary. These training points are called support vectors and give the support vector machine its name. To make a prediction for a new point, the distance to each of the support vectors is measured. A classification decision is made based on the distances to the support vector, and the importance of the support vectors that was learned during training. 

8. NEURAL NETWORK - MULTILAYER PERCEPTRONS (MLP)
A family of algorithms known as neural networks has recently seen a revival under the name “deep learning". In this project deep details for a neural network will not discussed, and a relatively simple method will be used such as multilayer perceptrons (MLP).

Similar to neurons in the human brain, artificial neural networks are formed by interconnected neurons, also called nodes, which interact with each other through axons, called edges. In a neural network, the nodes are stacked up in layers and generally start with a broad base. The first layer consists of raw data such as numeric values, text, images, or sound, which are divided into nodes. Each node then sends information to the next layer of nodes through the network’s edges. Each edge has a numeric weight that can be altered and formulated based on experience. If the sum of the connected edges satisfies a set threshold, known as the activation function, it will activate a neuron at the next layer. However, if the sum of the connected edges does not meet the set threshold, the activation will not be triggered.

A typical neural network can be divided into input, hidden, and output layers. Data is first received by the input layer, where broad features are detected. The hidden layers then analyze and process the data. The final result is shown as the output layer.
SIMULATION RESULTS

Once the algorithms are introduced, let's run all algorithm with different parametrization to understand the operation and check the results. The results of the simulations are captured in an excel file to visually compare modeling and predicting time, and train and test score among all algorithms and parametrizations. A screenshot for the output summary is as follows.

The accuracy method, taking into account the dataset is not imbalanced, is accuracy, meaning the pu percentage of correct classifications vs all classified samples. The ideal value is 1, and as there are two classes (1 and 0), the accuracy score for a random method would be 0.5. Therefore, our score will be between 0.5 and 1, being the higher, the better. 

Note that cross validation might be applied with several splits to reach more robust results, but the objective of this project is not finding the most optimal model, but introducing and understanding the most common supervised machine learning algorithms.
The first comment would be to highlight that the max score for the test dataset is not great (around 0.73), as anticipated after assessing the dataset histogram. 

The algorithm with higher test score are MLP, GradientBoosting, RandomForest and DecisionTree. The difference among them with the simulated settings are very minor, but the execution time of DecisionTree is meaningfully lower than the rest. Therefore, for this particular dataset the DecisionTree would look a good trade-off between accuracy, memory usage and computational time. If computational time and memory usage is not a concern, a deeper analysis in the parametrization for GradientBoosting, RandomForest or MLP might lead to a slight increase in accuracy.

Let's also assess each algorithm performance for the simulations performed in this project for the particular dataset under analysis.

KNN overfits with only one neighbor (train 0.99 vs test 0.63) and it relaxes the model when k increases (train 0.7373 vs test 0.7121), but in exchange of a higher computational time. 
Lower k --> model more complex --> more likely to overfit, but faster execution
Higher k --> smoother model --> less likely to overfit, but slower execution
Data needs to be scaled to optimize the results

LinearSVC is sensitive to data scaling, not working with unscaled data. 
Higher C --> lower regularization --> very complex model --> more likely to overfit and execution time increases
Lower C --> higher regularization --> simpler model --> less likely to overfit and execution time is very fast
Results are highly sensitive to scaling, facing converging issues if data was not properly scaled

Logistic Regression depicts a very fast and acceptable performance for both high and low regularization, with no significant variance with the regularization parameter for this particular dataset. 
Higher C --> lower regularization --> more likely to overfit
Lower C --> higher regularization --> less likely to overfit
Results are affected by scaling, leading to better results when scaling

NaiveBayes depicts a very fast and acceptable performance, but being worse than linear models.
Data does not need to be scaled

DecisionTree overfits when max_depth is high, meaning the tree is very close to have all leaves pure (with datasets of the same class) and so, being too adjusted on training data trends but offering a mediocre testing performance. When max_depth is reduced, the algorithm stop overfitting and provide a very fast and good performance.
Higher max depth --> More complex model --> More likely to overfit
Execution time is very fast, especially for predicting time
Data does not need to be scaled

RandomForest also overfits when using a few trees (n_estimators low), a soft randomization among trees (max_features high) or using high max_depth as explained in DecisionTree. The testing performance increases when n_estimators increases and max_features and max_depth decrease. Execution time increases compared to DecisionTree, but still acceptable.
Low number of estimators with high depth and features --> model very complex --> overfit
Reducing max features or depth and increasing number of estimators --> make easier the model --> test data improves
Data does not need to be scaled

GradientBoosting also overfits when using lots of trees (n_estimators high) since the algorithm learn too much about the training data. The overfit reduces when n_estimators and the learning_rate reduce since each tree learns less from the previous one and there are less tree, so results are less focused on training data improving the testing set accuracy. Execution time increases compared to DecisionTree, but still acceptable.
High number of estimators with high learning rate --> model very complex --> overfit
Reducing number of estimators or learning rate --> easier model --> less likely to overfit
Data does not need to be scaled

MLP execution time is sensitive to parametrization. Increasing hidden layers highly affects to execution time. For example, 100 internal nodes split in two layers of 50 each is 10x slower than a single layer of 100 nodes. Parameter alpha also affects in the model complexity and in turn, the likelihood to overfit. 
Higher hidden layers and low alpha --> model very complex --> overfit and execution very low
Results are highly sensitive to scaling

SVM works very slow with this particular dataset since the computational workload (sample data for around 70k patients) is out of the limit for this algorithm. Execution time is even slower for unscaled data. The results achieved in this dataset after the long execution looks inferior than other methods.
Higher C and gamma --> lower regularization --> more likely to overfit
Lower C and gamma --> higher regularization --> less likely to overfit
Execution time is very high and even higher for unscaled data
Algorithm not focused for large datasets
Results are highly sensitive to scaling
In the current project original dataset was scrubbed and visualized, and then the most common supervised machine learning algorithms were introduced and applied to understand and interpret the results. Next work should focus on finding the most optimal parameter and model combination maximizing the testing set score. 
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
Machine Learning Supervised Algorithms
Published:

Machine Learning Supervised Algorithms

Published:

Creative Fields