1 Introduction
Boosting algorithms adaptively learn a series of weak learners that overall yield often stateoftheart predictive accuracy, but are computationally slow. Huge datasets with many instances and features (e.g., data recorded from sensors, texts, and images) highlight this ”slow learning” behavior. For example, AdaBoost [freund1997decision] and gradient boosting [friedman2001greedy] suffer from a slow training speed since they try to achieve a proper performance neglecting the size of the data. Further, most boosting methods are ”blackbox”, meaning that their interpretation[lipton2018mythos] lacks transparency and simplicity. Identification of features and examples with the most impact helps to obtain a more interpretable model [biau2016random, benard2020interpretable]. In this paper, our goal is to develop an AdaBoostbased algorithm that learns faster computationally and also yields interpretable solutions.
We propose to achieve this by adaptively sampling tiny subsets of both instances and features simultaneously, something we refer to as a minipatch learning (Fig. 1). Subsampling is widely used in machine learning and has been shown to have both computational and predictive advantages. For instance, bagging [breiman1996bagging] uses the bootstrap technique [Efron1993AnIT]
to reduce the dependency of weak learners to the training data. Random forest
[breiman2001random], as a specific type of bagging, reduces the number of efficient features in each weak learner as well. The computational advantages of the ensemble of uniform random minipatches have been investigated in [louppe2012ensembles].There is preliminary evidence that uniform minipatch selection yields implicit regularization [srivastava2014dropout, lejeune2019implicit]; however, dropout is not precisely what we refer to as minipatch learning. In fact, the combination of minibatch selection (i.e., stochastic optimization) and dropout in the first layer lies in the category of minipatch learning algorithms. Minipatch sampling not only can accelerate ensemble algorithms, but it can also implicitly regularize them, i.e., reduce the prediction’s sensitivity to individual instances or features.
Subsampling can speed up iterative algorithms; however, uniform sampling can appropriately be modified with an adaptive procedure as the importance and difficulty of the observations and features vary. To give an example, in classification problems, observations closer to the decision boundary play the critical role in the ultimate model [cortes1995support]. Moreover, we can use adaptive sampling to interpret the final results. For instance, SIRUS algorithm [benard2019sirus, benard2020interpretable] suggests how to leverage the frequency of splits in random forest trees to generate an interpretable model with fewer splits. Our MPBoost algorithm will incorporate the advantages of adaptivity in order to learn distributions on the observations and features.
More closely related to our work, several have proposed to employ sampling schemes on the top of AdaBoost. Works along this line usually target either subsampling the features or observations adaptively. Exploiting bandit algorithms, adaptive feature selection methods [busa2009bandit, busa2010fast] have been suggested to reduce the number of effective features accessed by each weak learner. Similarly, Tasting [dubout2011tasting] and Laminating [dubout2011boosting, dubout2014adaptive] propose scorebased feature selection algorithms in each iteration of AdaBoost. They subsample features in each iteration adaptively but either utilize all instances to train weak learners or subsample them uniformly.
On the other hand, algorithms like MadaBoost [domingo2000madaboost] and FilterBoost [bradley2008filterboost] suggest adaptive oracle subsampling of the observations. However, they do not subsample features and the size of selected samples increases per iteration, thus get slower during their progress. In contrast, the algorithm that we will propose exploits the effect of minipatch selection; in this way, it will take the importance of both the observations and features into account.
A series of algorithms have been proposed to reduce the computational complexity of gradient boosting. For example, stochastic gradient boosting (SGB) [friedman2002stochastic]
suggests subsampling the observations randomly. XGBoost
[chen2016xgboost]by introducing a novel structure for decision trees, a fast training algorithm, as well as several other modifications such as features subsampling techniques through a presorting and histogrambased algorithm, extensively optimizes the computational complexity of gradient boosting.
In this regard, LightGBM [ke2017lightgbm] and CatBoost [dorogush2018catboost]
subsample observations adaptively proportionate to their gradient values and use an algorithm called exclusive feature bundling (EFB) to reduce the number of effective features by categorizing them. In contrast, MPBoost will be designed to learn a probability distribution on features gradually during its progress instead of categorizing them initially. The minimal variance sampling (MVS)
[ibragimov2019minimal] algorithm is proposed to select the observations according to their gradient values provably with a minimal variance; however, it lacks subsampling over the features. Note that MPBoost will be designed based on AdaBoost; thus, its adaptive observation selection will entirely be different from algorithms that are designed based on gradient boosting.The remainder of the paper is organized as follows. In Section 2, stating the problem formulation, we present MPBoost. In Section 3, we investigate the efficacy of the adaptive subsampling, interpretability of the proposed algorithm, and compare the generalization accuracy of MPBoost with that of AdaBoost, gradient boosting, and random forest. We end with discussing and concluding remarks in Section 4.
2 MPBoost
Our goal is to develop a boosting algorithm utilizing adaptive sampling of features and observations that enhances both scalability and interpretability.
2.1 MPBoost Algorithm
We begin by focusing on binary classification tasks. Let our data be for observations or instances and features; for each instance, we observe a label, with
. We seek to learn a classifier
.To achieve this, we propose an adaptive sampling based version of boosting inspired by AdaBoost. Our method relies on learning a weak learner from a tiny subset of observations and features at each iteration. We call this tiny subset a minipatch, termed based on the use of ”patches” in image processing, and minibatches as small subsamples of observations commonly used in machine learning. Our approach is to take an ensemble of minipatches, or minipatch learning as shown in Figure 1, where each minipatch is sampled adaptively. Formally, we define a minipatch by , where is a subset of observations with size , and is a subset of features with size . By learning weak learners on tiny subsets or minipatches, our boosting algorithm will have major computational advantages for large and/or large datasets.
We define to be the class of weak learners [mohri2018foundations], where each is a function . Our algorithm is generic to the type of weak learners, which can be either simple or expressive. However, we select decision trees [breiman1984classification] as the default weak learner in MPBoost. We consider both depth trees as well as saturated trees that are split until each terminal leaf consists of samples from the same class.
The core of our algorithm uses adaptive sampling of observations to achieve the adaptive slow learning properties [wyner2017explaining] of the AdaBoost algorithm. Similar to [ibragimov2019minimal] for gradient boosting, MPBoost subsamples observations according to an adaptive probability distribution. Let be the probability distribution on observations (i.e., ) and initially set to be uniform (). We define as sampling a subset of of size according to the probability distribution without replacement.
Let be the ensemble function. Our algorithm selects a minipatch, trains a proper weak learner on it, and computes the summation of weak learners, . Misclassified samples are more difficult to be learned, so we need to increase their probabilities to be sampled more frequently. Let be a function that measures the similarity between the ensemble outputs and labels, i.e., positive yields smaller and vice versa. MPBoost assigns a probability proportional to to the observation. Table I shows the choices for function inspired by the weighting function in AdaBoost and LogitBoost [friedman2000additive]. Note that ”Soft functions” are sensitive to the ensemble output value, while ”Hard functions” merely care about its sign.
Fullbatch boosting algorithms reweight all of the observations and train a new weak learner on their weighted average in each iteration [freund1997decision, friedman2000additive, friedman2001greedy]. In contrast, stochastic algorithms use each sample’s frequency to take the effect of its weight into account[domingo2000madaboost, bradley2008filterboost, ibragimov2019minimal]. We update the probability of the observations similar to FilterBoost [bradley2008filterboost]. However, FilterBoost increases during its progress and uses negative sampling, and is thus slower than ours.
Exponential  Logistic  
Soft  
Hard 
that makes MPBoost less sensitive to outlier samples.
Another major goal is to increase the interpretability of boosting approaches. To accomplish this, we also propose to adaptively select features that are effective for learning. Similar to , let be the probability distribution on features. Our algorithm requires a criterion to compute the importance of the selected features based on the structure of . There exist several choices for computing features importance based on . Some of these inspection techniques are model agnostic, hence proper for MPBoost to incorporate weak learners from different classes. For example, the permutation importance method [breiman2001random, altmann2010permutation] shuffling each feature infers its importance according to the difference in the prediction score.
Nevertheless, specific metrics like impurity reduction score [nembrini2018revival] are defined for decision trees. We utilize this quantity to define our probability distribution over the features. Let
denote the normalized feature importance vector for a weak learner
, wherein each entry determines the relative importance of the corresponding feature compared to other features in the minipatch. In each iteration, is updated through computing the weighted average of and according to a momentum. The hyperparameter
determines the ratio of exploration vs. exploitation. MPBoost only modifies the probability of features inside the minipatch, in each iteration.Finally, many boosting algorithms are designed to run for a fixed number of iterations[friedman2001greedy] or use a validation criterion [domingo2000madaboost, bradley2008filterboost, Meijer2016RegularizingAW] in order to determine when to stop. Internal validation approaches often have better performance and are computationally much faster. For instance, consider the outofbag criterion [breiman1996out] in bagging and random forest that uses internal validation properties without incurring any additional computational cost. Similarly to bagging, our MPBoost has access to outofpatch instances, which we can use for internal validation. Therefore, for each sample , we accumulate the output of weak learners that do not have it in their minipatch. Thus, we define outofpatch output to be a function as follows:
(1) 
for an arbitrary . Accordingly, the outofpatch accuracy, , can easily be quantified.
Outofpatch accuracy is a conservative estimate of the test accuracy. Hence it can assist MPBoost to track the progress of the generalization (test) performance internally and decide at which iteration to stop. In a nutshell, observing the
value, the algorithm finds where it is saturated. Algorithm 2 (in Appendix B) is a heuristic algorithm that takes
, compares it with its previous values, and finally decides when it becomes saturated. In fact, the stopping algorithm follows a general rule; if the current value of increases with some margin, then the algorithm needs more time to improve; otherwise, the generalization performance is saturated.We put all of this together in a summary of our MPBoost algorithm in Algorithm 1. Notice here that selecting minipatches (step 1) reduces the computational complexity imposed per iteration, thus improves the scalability. Other variations of AdaBoost usually subsample either features or observations while ours exploits both. Therefore, in addition to the predictive model , our algorithm learns probability distributions and that express the importance of observations and features, respectively. Since learning is a part of the iterative procedure, it does not incur an extra computational cost. In addition, MPBoost exploits an internal validation that yields an automatic stopping criterion when the algorithm ceases to learn. Hence, steps (4) and (5) of Algorithm 1 highlight the main differences of MPBoost with other samplingbased boosting algorithms.
2.2 Hyperparameter Tuning
The minipatch size is a crucial hyperparameter of our algorithm. Large or slows down weak learners’ training but is more likely to yield better performance. Note that must be large enough to provide a wide range of features for comparison and update features probability properly. Similarly, must be large enough such that each minipatch represents a meaningful subset of the observations. On the other hand, small results in a value akin to the generalization accuracy. Selecting around ten percent of the observations and features seems to be a proper choice for our algorithm, as evidenced in our empirical studies. Additionally, our studies reveal that the results are fairly robust to small changes in and . is the other important hyperparameter in our algorithm where a moderate value for it (e.g., ) strikes a balance between exploration and exploitation. While our default hyperparameter settings seem to perform well and are robust in a variety of settings (see Section 3), one could always select these in a datadriven manner using our criterion as well.
2.3 Extensions
Our proposed algorithm is initially developed for the binary classification problem. Here, we discuss its extension to regression and multiclass classification problems. First, for multiclass classification, algorithms like AdaBoost.M2, AdaBoost.MH, and AdaBoost.OC [schapire1997using, chengsheng2017adaboost] are proposed as multiclass extensions of AdaBoost. We can incorporate similar modifications to extend MPBoost as well. Further, [drucker1997improving, solomatine2004adaboost] have suggested regression extensions like AdaBoost.R2 and AdaBoost.RT to the vanilla AdaBoost. Obviously, for regression, we will have to change our observation probability function and the outofpatch accuracy, thus we can employ techniques in [breiman1996out] to address the regression problem. All these can be further extensions to our approach.
3 Experiments
We begin by using an illustrative case study to show how our method works and how it aids interpretability. Next, we compare our algorithm to other popular boosting and treebased methods, focusing on accuracy and scalability.
(a) Effect of MPBoost adaptive sampling on MNIST (
vs. ) test data. (b) Train, test, and outofpatch accuracy of MNIST ( vs. ) using MPBoost. The dashed line shows the stopping time based on Algorithm2.3.1 Illustrative Case Study
We use a series of experiments to demonstrate how our algorithm works and show how to interpret the results. Specifically, we ask: how does adaptive sampling boost the performance; how does the outofpatch accuracy relate to test accuracy and yield a datadriven stopping criterion; how do we interpret the results via the final value of and ?
To answer these questions, we focus our investigations on an explicable binary classification task: detecting digit versus in MNIST [lecun1998mnist]. This dataset includes handwritten digits as images of size . The training data is huge () and highdimensional (). We use crossvalidation to tune all hyperparameters, yielding , , and as well as the SoftLogistic function I as .
To measure the effect of adaptive observation and/or feature selection, we train MPBoost on MNIST(). Then, we turn off the adaptive updates for
and replace it with the uniform distribution and resulting random observation sampling. We do the same for
analogously for the features. Finally, we turn off adaptive sampling for both features and observations, and repeat each experiment times. Figure 1(a) shows the superiority of joint adaptive sampling of both observations and features in terms of performance on the test data. Further, we compare the training, outofpatch, and test accuracy curves for MPBoost in Figure 1(b). The dashed line indicates the stopping time of our algorithm based on the curve and our stopping heuristic Algorithm 2. To observe the behavior of the three curves, we let MPBoost continue after the stopping criterion is satisfied. As shown in Figure 1(b), and unlike the training curve, the trend in the outofpatch curve is similar to that of the test curve. These results demonstrate the power that adaptive sampling of both observations and features brings to boost performance as well as the ability to use the curve to assess algorithm progress.Next, we illustrate how to use and to interpret the observations and features. First, observations that are difficult to classify are upweighted in . Hence, we can use
to identify the most challenging samples, yielding a similar type of interpretation commonly employed in support vector machines
[cortes1995support]. To visualize the final value of , we project observations on a twodimensional space using PCA with sizes of each observation proportional to . We show this visualization before and after the training in Figure 3. Further, we highlight a few of the observations with large values of , illustrating that these samples are indeed difficult to distinguish between the two classes.

Finally, we show how to use to find the most important features, and also illustrate how MPBoost learns these features probability in Figure 4. Here, the color of each pixel (feature) is proportional to its probability, with darker pixels indicating the feature has been upweighted. We expect a sparse representation for which matches with our result. Moreover, this example clearly shows how to interpret the most relevant features; in Figure 3(f), two regions are darker compared to other pixels corresponding with the complementary area for digit versus .
3.2 Comparative Empirical Results
Here, we compare the speed and performance of our algorithm with AdaBoost, gradient boosting, and random forest on multiple binary classification tasks. To this end, we select large real datasets (usually and ) from UCI machine learning repository [Dua:2019], MNIST [lecun1998mnist]
, and CIFAR10
[krizhevsky2009learning]. Moreover, we use a sparse synthetic dataset of two highdimensional cones. The size of datasets is provided in Appendix A.To be fair, we choose the oracle hyperparameters for every method. To this end, we pick decision trees with different maximum depths () or depthsaturated trees as weak learners. Note that we use scikitlearn [scikitlearn] modules to implement our algorithm and all competitors so that all timebased comparisons are fair. For our method, we select from the following choices of hyperparameters: , , and .
For each dataset, we select the best performance of MPBoost, versus that of AdaBoost, gradient boosting, and random forest constrained to the runtime of MPBoost. Table II shows the best performance of each algorithm within the fixed runtime (MPBoost training time). Results indicate that MPBoost achieves a better performance faster than the other three algorithms. We also provide a more comprehensive comparison in Table III, where we can see that without any runtime constraint, MPBoost is much faster with a comparable accuracy across a wide variety of datasets.
Dataset  MPBoost  AdaBoost  Random Forest  Gradient Boosting 
Cones  
HillValley  
Christine  
Jasmine  
Philippine  
SensIT Vehicle  
Higgs Boson  
MNIST (3,8)  
MNIST (O,E)  
CIFAR10 (T,C)  
GAS Drift  
DNA  
Volkert 
4 Discussion
In this work, we proposed a new boosting approach using adaptive minipatch learning. We showed that our approach is computationally faster, as well as more interpretable, compared to standard boosting algorithms. Moreover, we showed that our algorithm outperforms AdaBoost and gradient boosting in a fixed runtime and has a comparable performance without any time constraint.
Our approach would be particularly useful for large data settings where both observations and features are large. Further, our MPBoost algorithm can be particularly fitting to datasets with sparse features or noisy features because we learn the important features. Further, our adaptive observation selection can be used in active learning problems.
We suggested extensions of MPBoost for multiclass classification and regression problems in this paper. In future work, we plan on comparing them empirically to existing algorithms. Other aspects of future work would be to use this minipatch learning scheme in not only AdaBoostlike methods but also in gradient boostingbased methods. Additionally, one could explore other updating schemes for both the observation and feature probabilities. Theoretical analysis of MPBoost would help to interpret the performance of our algorithm and its properties. Efficient implementations can improve the speed of current MPBoost for different schemes. Likewise, efficient memory allocation is another essential work ahead of this project, where reducing the required hardware enables industryscale problems to exploit properties in MPBoost.
Dataset  Criteria  MPBoost  AdaBoost  Random Forest  Gradient Boosting 
Cones  Test  
Time  
HillValley  Test  
Time  
Christine  Test  
Time  
Jasmine  Test  
Time  
Philippine  Test  
Time  
SensIT Vehicle  Test  
Time  
Higgs Boson  Test  
Time  
MNIST (3,8)  Test  
Time  
MNIST (O,E)  Test  
Time  
CIFAR10 (T,C)  Test  
Time  
GAS Drift  Test  
Time  
DNA  Test  
Time  
Volkert  Test  
Time 
References
Appendix A Dataset
According to our specific problem, we select datasets listed in Table IV from UCI machine learning repository [Dua:2019], MNIST [lecun1998mnist], and CIFAR [krizhevsky2009learning]. Note that we consider multiclass datasets like MNIST and CIFAR10 and select two classes that are harder to be distinguished like digits
from MNIST or ”Truck & Car” classes from CIFAR10. Additionally, we partition all classes into two for different multiclass datasets, like ”Odd & Even” digits in MNIST. About 20% of each dataset is selected as the test data. ”Cones” is a sparse synthetic dataset of two originally
dimensional cones. Adding noise features to them, we transform data points to .Dataset  
Cones  20000  500  5000 
HillValley  1000  100  200 
Christine  4300  1636  1100 
Jasmine  2400  144  600 
Philippine  4700  308  1100 
SensIT Vehicle  78800  100  19700 
Higgs Boson  200000  30  50000 
MNIST (3 vs. 8)  12000  784  2000 
MNIST (Odd vs. Even)  60000  784  10000 
CIFAR10 (Truck vs. Car)  10000  3072  2000 
Gas Drift  11100  128  2800 
DNA  2600  180  6000 
Volkert  46600  180  11700 
Fabert  6600  800  1600 
Appendix B Stopping Criterion Algorithm
The trend in is an approximation of the generalization accuracy; however, there are numerical oscillations in it. To make the stopping algorithm (Algorithm 2) robust to such numerical vibrations, instead of the largest previous value of , it keeps the highest previous values of in a list, , and compares the with their minimum. If the value does not improve after sequential iterations, then MPBoost halts.
Comments
There are no comments yet.