Ensemble Techniques in Machine
Learning
Ensemble techniques are
powerful methods in machine learning that combine multiple models to produce a
single, superior predictive model. By leveraging the strengths of different
models, ensembles can achieve higher accuracy, better generalization, and
increased robustness compared to individual models. These methods are
particularly useful in dealing with complex datasets and improving the
performance of machine learning models. Ensemble techniques work by training
multiple models and then combining their predictions in various ways. This
approach helps to reduce the risk of overfitting and increases the stability
and reliability of the predictions. Below are some common ensemble techniques
along with brief descriptions of how they work.
Bagging: Uses bootstrapped
datasets to train multiple models and averages their predictions. Random Forest
is a well-known example.
AdaBoost: Builds a sequence of
models, each correcting the errors of the previous ones. The final model is a
weighted sum of these models.
Gradient Boosting: Iteratively builds
models that correct the errors of the previous models by optimizing a loss
function.
XGBoost: An efficient
implementation of Gradient Boosting with regularization to prevent overfitting
and handle missing data.
Stacking: Combines multiple models
via a meta-model trained on the outputs of the base models to improve
performance.
Voting: Aggregates predictions
from multiple models by averaging or majority voting to produce the final
prediction.
Blending: Uses a holdout dataset
to train a meta-model on the predictions of base models, similar to stacking.
In the remainder of this
article, we will delve deeper into these techniques, explaining their
underlying methods. Additionally, we will demonstrate their effectiveness in a
binary classification problem using on a well-known Breast Cancer dataset from
the UCI Machine Learning Repository.
Bagging in Machine Learning
Bagging, which stands
for Bootstrap Aggregating, is an ensemble learning method designed to improve
the accuracy and robustness of machine learning models. This technique involves
training multiple models on different subsets of the training data, which are
generated by random sampling with replacement. The predictions from these
models are then combined, typically by averaging (for regression) or voting
(for classification), to produce a final prediction.
Key Benefits of Bagging:
1. Reduces Variance: By
averaging multiple models, bagging helps to smooth out the predictions,
reducing the impact of individual model errors and thereby decreasing variance.
2. Increases Stability:
It makes the model less sensitive to the noise in the training data, leading to
more reliable predictions.
3. Improves Accuracy:
The combination of multiple models often results in better overall performance
compared to any single model.
Example: Applying Bagging on the Breast
Cancer Dataset
To illustrate the power
of bagging, let's use the Breast Cancer dataset, which is a well-known dataset
for classification tasks. We'll compare the performance of a single Decision
Tree model with a Bagging Classifier that uses Decision Trees as its base models.
Here is the Python code
to perform this comparison:
import pandas as pdfrom sklearn.datasets import load_breast_cancerfrom
sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score
# Load the Breast Cancer dataset
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Train a single Decision Tree model
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
# Train a Bagging Classifier with Decision Trees
bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
accuracy_dt, accuracy_bagging
|
Results:
- Single Decision Tree
Model Accuracy: 94.15%
- Bagging Classifier
Accuracy: 95.91%
As the results indicate,
the Bagging Classifier achieved a higher accuracy compared to the single
Decision Tree model. This demonstrates the effectiveness of bagging in
enhancing the performance of machine learning models by combining the strengths
of multiple models and mitigating their individual weaknesses.
Bagging is a powerful
technique that can be applied to various machine learning tasks to achieve
better and more stable results. By leveraging the diversity and collective
wisdom of multiple models, bagging ensures that the final predictions are more
accurate and reliable.
Random Forest is an
ensemble learning method that falls under the category of Bagging (Bootstrap
Aggregating) ensembles. It constructs multiple decision trees during training
and combines their predictions to improve accuracy and control overfitting. Each
tree in the forest is trained on a random subset of the data with replacement
(bootstrap sampling) and considers a random subset of features when splitting
nodes. The final prediction is made by aggregating the predictions of all the
trees, typically using majority voting for classification tasks and averaging
for regression tasks. This method leverages the strengths of bagging to create
a robust and reliable predictive model.
Boosting in Machine Learning
Boosting is an ensemble
technique that combines multiple weak learners to form a strong learner. The
primary idea is to train models sequentially, each trying to correct the errors
of its predecessor. This iterative process focuses on the difficult cases that
previous models failed to predict correctly.
Key Boosting Techniques:
1. AdaBoost (Adaptive
Boosting):
- Mechanism: AdaBoost
assigns weights to each instance in the dataset. Initially, all instances have
equal weights. In each iteration, the model focuses more on the instances that
were misclassified by the previous model, adjusting the weights accordingly.
- Prediction: The final
prediction is a weighted vote of the predictions from all the models.
2. Gradient Boosting
Machines (GBM):
- Mechanism: GBM builds
models sequentially, where each new model tries to minimize the residual errors
made by the previous models.
- Optimization: It uses
gradient descent, a method to find the minimum of a function by iteratively
moving towards the steepest descent, to optimize the loss function, which
measures how well the model's predictions match the actual outcomes.
3. XGBoost (Extreme
Gradient Boosting):
- Mechanism: XGBoost is
an optimized implementation of gradient boosting designed for speed and
performance.
- Regularization: It
includes regularization techniques, which add penalties to the model complexity
to prevent overfitting, where the model performs well on training data but
poorly on unseen data, making it more robust and scalable (able to handle larger
datasets and more complex models efficiently).
Base Classifier
In these boosting
techniques, the base classifier is typically a decision tree. Specifically,
shallow decision trees, also known as decision stumps (trees with a depth of
one) or trees with limited depth, are used. These weak learners are essential
for the boosting process to be effective. While decision trees are the most
commonly used base classifiers, other classifiers such as support vector
machines, linear models, and neural networks can also be employed, but their
use is less common and often more complex to implement.
Comparison of Base Classifier and
Boosting Techniques
1. Unrestricted Decision Tree Example
from sklearn.datasets
import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Train a Decision Tree model without depth limit
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
accuracy_dt
|
2. Decision Stump Example (Base Classifier for
Boosting)
# Train a Decision
Tree model with max_depth=1
dt_stump = DecisionTreeClassifier(max_depth=1, random_state=42)
dt_stump.fit(X_train, y_train)
y_pred_dt_stump = dt_stump.predict(X_test)
accuracy_dt_stump = accuracy_score(y_test, y_pred_dt_stump)
accuracy_dt_stump
|
3. AdaBoost Example
from sklearn.ensemble
import AdaBoostClassifier
# Train an AdaBoost Classifier with Decision Stumps
ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=50, random_state=42)
ada.fit(X_train, y_train)
y_pred_ada = ada.predict(X_test)
accuracy_ada = accuracy_score(y_test, y_pred_ada)
accuracy_ada
|
4. Gradient Boosting Machines (GBM) Example
from sklearn.ensemble
import GradientBoostingClassifier
# Train a Gradient Boosting Classifier with shallow Decision Trees
gbm = GradientBoostingClassifier(n_estimators=50, max_depth=3,
random_state=42)
gbm.fit(X_train, y_train)
y_pred_gbm = gbm.predict(X_test)
accuracy_gbm = accuracy_score(y_test, y_pred_gbm)
accuracy_gbm
|
5. XGBoost Example
import xgboost as xgb
# Train an XGBoost Classifier with shallow Decision Trees
xgb_model = xgb.XGBClassifier(n_estimators=50, max_depth=3, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
accuracy_xgb
|
Results:
- Unrestricted Decision Tree Accuracy: 94.15%
- Base Classifier (Decision Stump) Accuracy: 61.11%
- AdaBoost Classifier Accuracy: 97.66%
- Gradient Boosting Classifier Accuracy: 95.91%
- XGBoost Classifier Accuracy: 95.91%
These results
demonstrate that the boosting techniques (AdaBoost, Gradient Boosting, and
XGBoost) can indeed achieve higher accuracy compared to a single unrestricted
Decision Tree on a more challenging dataset. AdaBoost, in particular,
outperformed the unrestricted Decision Tree, showing its effectiveness in
enhancing model performance. This highlights the power of boosting methods in
handling complex datasets and improving predictive accuracy.
Stacking (Stacked Generalization) in Machine Learning
Stacking, also known as
stacked generalization, is an ensemble technique that combines multiple machine
learning models to create a stronger predictive model. It works by training
several base models (also called level-0 models) and then combining their
predictions using a meta-model (also called level-1 model). The meta-model
learns how to best combine the base models' predictions to improve overall
performance.
Key Points:
1. Base Models
(Level-0): These are the initial models that make predictions on the
dataset. They can be any machine learning algorithms, such as decision trees,
logistic regression, or neural networks. Multiple base models are used to
capture different patterns in the data.
2. Meta-Model
(Level-1): This model takes the predictions of the base models as input and
learns how to combine them to produce the final prediction. The meta-model is
usually a simple model like linear regression or logistic regression, but more
complex models can also be used.
Implementation on the Breast Cancer
Dataset
Dataset Information
The Breast Cancer
dataset is a binary classification problem with features that are more complex
and less likely to be perfectly fit by a single decision tree, making it
suitable for demonstrating the power of stacking.
1. Base Models and Unrestricted
Decision Tree
from sklearn.datasets
import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Train a Decision Tree model without depth limit
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
# Train a Support Vector Machine model
svm = SVC(probability=True, random_state=42)
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
accuracy_svm = accuracy_score(y_test, y_pred_svm)
# Train a K-Neighbors Classifier model
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
accuracy_knn = accuracy_score(y_test, y_pred_knn)
# Train a Gaussian Naive Bayes model
nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
accuracy_nb = accuracy_score(y_test, y_pred_nb)
accuracy_dt, accuracy_svm, accuracy_knn, accuracy_nb
|
2. Stacking Example
from sklearn.ensemble
import StackingClassifier
from sklearn.linear_model import LogisticRegression
# Define base models
base_models = [
('dt',
DecisionTreeClassifier(random_state=42)),
('svm', SVC(probability=True,
random_state=42)),
('knn', KNeighborsClassifier()),
('nb', GaussianNB())
]
# Define meta-model
meta_model = LogisticRegression()
# Train a Stacking Classifier
stacking = StackingClassifier(estimators=base_models,
final_estimator=meta_model, cv=5)
stacking.fit(X_train, y_train)
y_pred_stacking = stacking.predict(X_test)
accuracy_stacking = accuracy_score(y_test, y_pred_stacking)
accuracy_stacking
|
Results:
- Decision Tree Accuracy: 94.15%
- Support Vector Machine Accuracy: 93.57%
- K-Neighbors Classifier Accuracy: 95.91%
- Gaussian Naive Bayes Accuracy: 94.15%
- Stacking Model Accuracy: 97.08%
These results
demonstrate that the stacking model achieved a higher accuracy compared to each
individual base model on the Breast Cancer dataset. This highlights the
effectiveness of stacking in combining the strengths of multiple base models to
improve predictive performance.
Voting Ensembles in Machine Learning
Voting ensembles combine
the predictions of multiple models and make a final prediction based on a
majority vote (for classification) or average (for regression). There are two
main types of voting:
1. Hard Voting
(Majority Voting): Each model in the ensemble makes a prediction (vote),
and the final prediction is the one that gets the majority of the votes.
2. Soft Voting
(Weighted Voting): Each model in the ensemble outputs a probability for
each class, and the final prediction is made by averaging these probabilities
(optionally weighted by model performance).
We provide an example of
using both hard and soft voting strategies with the following classifiers as
base models: Decision Tree, Support Vector Machine, K-Nearest Neighbors, and
Gaussian Naive Bayes.
1. Base Models and Unrestricted Decision Tree
from sklearn.datasets
import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Train a Decision Tree model without depth limit
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
# Train a Support Vector Machine model
svm = SVC(probability=True, random_state=42)
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
accuracy_svm = accuracy_score(y_test, y_pred_svm)
# Train a K-Neighbors Classifier model
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
accuracy_knn = accuracy_score(y_test, y_pred_knn)
# Train a Gaussian Naive Bayes model
nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
accuracy_nb = accuracy_score(y_test, y_pred_nb)
accuracy_dt, accuracy_svm, accuracy_knn, accuracy_nb
|
2. Voting Ensemble Example
from sklearn.ensemble
import VotingClassifier
# Define base models
base_models = [
('dt',
DecisionTreeClassifier(random_state=42)),
('svm', SVC(probability=True,
random_state=42)),
('knn', KNeighborsClassifier()),
('nb', GaussianNB())
]
# Train a Voting Classifier (Hard Voting)
voting_hard = VotingClassifier(estimators=base_models, voting='hard')
voting_hard.fit(X_train, y_train)
y_pred_voting_hard = voting_hard.predict(X_test)
accuracy_voting_hard = accuracy_score(y_test, y_pred_voting_hard)
# Train a Voting Classifier (Soft Voting)
voting_soft = VotingClassifier(estimators=base_models, voting='soft')
voting_soft.fit(X_train, y_train)
y_pred_voting_soft = voting_soft.predict(X_test)
accuracy_voting_soft = accuracy_score(y_test, y_pred_voting_soft)
accuracy_voting_hard, accuracy_voting_soft
|
Results:
- Decision Tree Accuracy: 94.15%
- Support Vector Machine Accuracy: 93.57%
- K-Neighbors Classifier Accuracy: 95.91%
- Gaussian Naive Bayes Accuracy: 94.15%
- Voting Classifier (Hard Voting) Accuracy: 98.25%
- Voting Classifier (Soft Voting) Accuracy: 98.25%
These results
demonstrate that both the hard voting and soft voting classifiers achieved
higher accuracy compared to each individual base model on the Breast Cancer
dataset. This highlights the effectiveness of voting ensembles in combining the
strengths of multiple models to improve predictive performance.
Blending in Machine Learning
Blending is an ensemble
technique that combines the predictions of multiple base models. The base
models are trained on a training dataset, and their predictions are used as
features to train a meta-model. The main difference between blending and
stacking is that in blending, the meta-model is trained on a separate holdout
set, not on the entire training set through cross-validation.
Key Points:
1. Base Models:
Multiple base models are trained on the training dataset.
2. Holdout Set: A
portion of the training data is set aside as a holdout set.
3. Meta-Model:
The meta-model is trained on the predictions of the base models on the holdout
set.
1. Base Models and Unrestricted Decision Tree
from sklearn.datasets
import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
# Load the Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
# Split the dataset into training, validation, and testing sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3,
random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp,
test_size=0.5, random_state=42)
# Train a Decision Tree model without depth limit
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
# Train a Support Vector Machine model
svm = SVC(probability=True, random_state=42)
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)
accuracy_svm = accuracy_score(y_test, y_pred_svm)
# Train a K-Neighbors Classifier model
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
accuracy_knn = accuracy_score(y_test, y_pred_knn)
# Train a Gaussian Naive Bayes model
nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
accuracy_nb = accuracy_score(y_test, y_pred_nb)
accuracy_dt, accuracy_svm, accuracy_knn, accuracy_nb
|
2. Blending Example with Logistic Regression Meta-Model
import numpy as np
from sklearn.linear_model import LogisticRegression
# Generate predictions on the validation set using the base models
val_preds = np.zeros((X_val.shape[0], 4))
base_models = [
('dt',
DecisionTreeClassifier(random_state=42)),
('svm', SVC(probability=True,
random_state=42)),
('knn', KNeighborsClassifier()),
('nb', GaussianNB())
]
for i, (name, model) in enumerate(base_models):
model.fit(X_train, y_train)
val_preds[:, i] =
model.predict(X_val)
# Train a Logistic Regression meta-model on the predictions of the base
models on the validation set
meta_model = LogisticRegression()
meta_model.fit(val_preds, y_val)
# Generate predictions on the testing set using the base models
test_preds = np.zeros((X_test.shape[0], 4))
for i, (name, model) in enumerate(base_models):
test_preds[:, i] =
model.predict(X_test)
# Evaluate the blended model on the testing set
y_pred_blend = meta_model.predict(test_preds)
accuracy_blend_adjusted = accuracy_score(y_test, y_pred_blend)
accuracy_blend_adjusted
|
Results:
- Decision Tree Accuracy: 94.15%
- Support Vector Machine Accuracy: 93.57%
- K-Neighbors Classifier Accuracy: 95.91%
- Gaussian Naive Bayes Accuracy: 94.15%
- Blended Model Accuracy: 97.67%
These results
demonstrate that the blended model, using a larger holdout set and a Logistic
Regression classifier as the meta-model, achieved a higher accuracy compared to
the individual base models. This highlights the effectiveness of blending in
combining the strengths of multiple models to improve predictive performance.
Comparison of Ensemble Techniques
The table below is based
on general observations and experiences with these ensemble methods. It
provides a qualitative comparison rather than quantitative data derived from
the specific implementation on a specific dataset.
Method
|
Accuracy
|
Robustness
|
Computational Complexity
|
Ease of Implementation
|
Random Forest
|
High
|
High
|
Moderate
|
Easy
|
AdaBoost
|
Moderate
|
Moderate
|
Moderate
|
Easy
|
Gradient Boosting
|
High
|
High
|
High
|
Moderate
|
XGBoost
|
High
|
High
|
High
|
Moderate
|
Stacking
|
High
|
High
|
High
|
Moderate
|
Voting
|
Moderate
|
High
|
Low
|
Easy
|
Blending
|
High
|
High
|
High
|
Moderate
|
Accuracy:
Gradient Boosting and XGBoost typically achieve the highest accuracy.
Robustness:
Random Forest, Stacking, and Blending are generally robust to overfitting.
Computational
Complexity: XGBoost and Gradient Boosting are computationally intensive.
Random Forests and Voting are less so.
Ease of
Implementation: Voting and Bagging are easiest to implement. Stacking and
Blending are more complex due to the need for meta-models.
This comparative
analysis helps identify the most suitable ensemble method based on the specific
requirements of your project.
To provide a more
detailed comparison, we present the results using the Breast Cancer dataset.
Initially, due to the high cleanliness and accuracy of this data, all models
performed exceptionally well, with Random Forest achieving the highest
accuracy. However, when we introduced some noise into the dataset to increase
the difficulty, AdaBoost emerged as the top performer. This highlights an
important point: no single classifier is universally superior in all scenarios.
Therefore, it is prudent to experiment with various techniques before selecting
the final classifier for a specific application. We have also shown typical
execution times for different algorithms executed on Google Colab. These times
can vary depending on factors such as the software and hardware platform and
the use of specific libraries.
Method
|
Accuracy (Original Data)
|
Accuracy (Noisy Data)
|
Execution Time (Seconds)
|
Random
Forest
|
97.67%
|
95.32%
|
0.172581
|
AdaBoost
|
95.61%
|
96.49%
|
0.240852
|
Gradient
Boosting
|
96.49%
|
94.74%
|
0.347579
|
XGBoost
|
96.78%
|
95.91%
|
0.137494
|
Stacking
|
97.08%
|
95.32%
|
4.486909
|
Voting
|
96.78%
|
95.32%
|
8.325362
|
Blending
|
97.67%
|
95.32%
|
14.68894
|