XGBoost Explained

As mentioned in XGBoost documentation

it is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.

but what is it actually

In layman's terms, it can be said it’s a machine learning algorithm that creates a bunch of decision trees and combines their result to make a final prediction.

Not to be confused at the bottom it's not a specific Machine Learning Algorithm but rather a concept that can be applied to a set of Machine Learning Algorithm.

XGBoost is popular because it has high accuracy and performs the action expeditiously. It is highly versatile and can be used both in Regression and Classification tasks and can handle a variety of data types.

How XGBoost achieves its high speed and efficiency?

Internally XGBoost has its optimized data structure for datasets called DMatrix which is the underlying reason for it being fast, efficient and accurate.

DMatrix can be thought of as a matrix of data that has been preprocessed and optimized for use with the XGBoost algorithm. It can handle different types of data, including dense or sparse matrices, and can be created from a variety of data sources, such as NumPy arrays, Pandas dataframes, or even external data files.

How boosting is accomplished?

Boosting in XGBoost is accomplished through an iterative process of building weak learners (also known as base learners) and combining them into a strong learner.

The steps employed for boosting

  • Iteratively learning a set of weak models on subsets of the data.

  • Weighing each weak prediction according to each weak learner's prediction.

  • Combine the weighted predictions to obtain a single weighted prediction

XGBoost incorporates several key features to make this process more effective and efficient. These include

  • Regularization

  • Tree pruning

  • Parallel processing

  • Gradient optimization

When to use XGBoost?

The best use cases for XGBoost are when

  • Large datasets: XGBoost is specifically designed to handle large datasets, making it a good choice for problems with many features or instances. It is also highly memory-efficient and can be used in distributed computing environments for even larger datasets.

  • Imbalanced datasets: XGBoost can effectively handle imbalanced datasets, where the number of instances in each class is not equal. This makes it useful for solving problems like fraud detection or disease diagnosis.

  • You have a mixture of categorical and numeric features or just numeric features.

But in some cases where XGBoost fails to work

  • Image recognition

  • Computer vision

  • Natural language processing and understanding problems

The reason for this is that XGBoost is a tree-based model, which means it is not well-suited for tasks that require an understanding of spatial relationships or sequence information.