Hyperparameter Tuning for Gradient Boosting Frameworks Using Statistical Experimental Design
Introduction/Background
Experimental design is a crucial part of any statistical experiment, including computer simulations. To explore the effects of efficient design on a computer experiment, we examine three published scientific papers and repeat the papers’ experiments with alternative methods of analysis using advanced experimental designs such as Latin Hypercube Sampling (LHS) and Maximum Projection Designs.
In the computer simulations discussed here, the experiments at hand are the machine learning models themselves, with the predictive performance of the model serving as the output and the various hyperparameters serving as inputs. A common theme throughout statistical and machine learning research papers – and all the computational papers we examine here – is that the authors implement an expensive grid search to optimize the model parameters. However, this approach can be very computationally expensive, both in terms of model run times and computing resources. Moreover, choosing hyperparameters in this way does not attempt to model or account for the relationships between the model performance and hyperparameters. We would like to address these shortcomings in our analysis.
For this project, we keep a few key goals in mind. First of all, we would like to achieve better predictive performance than what was reported in each of the individual papers, which are geared towards computational physics and chemistry and not super statistically fine-tuned for the utmost optimal model results. Secondly, we look to reduce the estimated run time to save on computational resources. And lastly, we seek to produce a systematic framework with which one can continually improve model performance. As for our methodology, we incorporate principles of computer experiment design in conjunction with Gaussian Process (GP) modeling, with more details presented in the Appendix section below.
In each paper, the authors use a common method: XGBoost (Chen et al. 2016). For a quick description of the model parameters see (Chen et al. 2018a). In the remainder of this report we discuss each of the investigated research papers in turn, comparing their experimental design methodology and final model results to that of our analysis.
Critical Temperatures of Superconductors
The first paper we analyze is “A Data-Driven Statistical Model for Predicting the Critical Temperature of a Superconductor”, published in the Journal of Computational Material Science. The author’s objective in this first paper is to predict at which temperature amaterial becomes a superconductor. A good use case for modeling this relationship is to pre-screen various compounds and find their critical temperatures. There is currently no widely accepted physical theory for the critical temperature of a superconductor, so using a statistical model is a good alternative to model the behavior.1
In the dataset that the author used and provided, the outcome variable is the critical temperature and the predictor variables are various physical properties of the material, such as atomic mass, atomic radius, etc. The data has 21,263 observations in total with 82 columns. For their analysis, the author conducted an exhaustive 198-point grid search over five hyperparameters: learning rate, column subsampling, row subsampling, minimum nodes, and maximum depth (Table 1). This amounts to taking all possible combinations of their chosen level settings. For performance evaluation, they used a 25-fold Monte Carlo cross-validation (CV) using root mean square error (RMSE) as the performance metric. For each iteration they split the data in two thirds for model fitting and one third for model evaluation.
Parameters | Levels |
---|---|
Learning rate | 0.010, 0.015, 0.020 |
Column subsampling | 0.25, 0.50, 0.75 |
Subsample ratio | 0.5 |
Minimum nodes | 1, 10 |
Maximum depth | 15, 16, …, 25 |
For our analysis, we construct a Maximum Projection Latin Hypercube with 10 data points (using the MaxPro package in R) and use the same evaluation procedure as the author of the original paper. To transform samples from our design to integers, for discrete hyperparameters such as maximum depth (number of leaves) of trees, we apply the samples to the inverse CDF of the uniform distribution and take the ceiling of the resulting values. To transform the continuous hyperparameters from our design into the appropriate range we can simply shift them by a linear transformation. (We repeat this procedure for all three papers.)
We can visualize the comparison of all two-dimensional projections of the design that we generate and the design used by the author (Figure 1). From the plots we can see that the author’s design is not space filling in all of its projections, however our design does enjoy nice space filling and projection properties. Some of the projections of the author’s design are absurd, such as the max depth and min node - discrete parameters - which project as a continuous line. Although some of the author’s parameters are uniformly distributed, like maximum depth, this is not the case for all of their parameters, especially for the continuous parameters like learning rate.

For the Gaussian Process (GP) Model, we use the Matern (5/2) covariance function with a small noise term added and linear mean functions. We introduce the noise term due to the fact that parameters like row and column subsampling are inherently random and lead to different performance evaluation. The estimated GP parameters can be seen in Table 2 and Table 3 below. Then, we run 10 iterations of the Expectation Globalization Optimization (EGO) algorithm. We can see from that after only a couple iterations, the gradient boosting model is performing at a lower RMSE than the author’s final reported value (Figure 2). In the end we arrive at an entirely different set of parameters than the author’s set (Table 4). Further, the computation cost of running the model from start to finish is drastically reduced. In this instance, we are able to achieve better performance in approximately 1/10th of the run time for the author’s experiment. We estimate the author’s run time by taking the average run time on our 12 core desktop computer and multiplying by the number of configurations the author used to find hyperparamter values.
Parameters | Estimates |
---|---|
(Intercept) | 88.299 |
Learning rate | 1.532 |
Subsample ratio | 0.4 |
Minimum nodes | -2.56 |
Maximum depth | 0.490 |
Parameters | Estimates |
---|---|
Theta(Learning rate) | 0.455 |
Theta(Subsample ratio) | 1.800 |
Theta(Minimum nodes) | 0.164 |
Theta(Maximum depth) | 1.800 |

Author’s Results | Our Results | |
---|---|---|
Learning rate | 0.016 | 0.029 |
Subsample ratio | 0.400 | 0.200 |
Minimum nodes | 10 | 11 |
Maximum depth | 16 | 10 |
RMSE | 44.090 | 41.911 |
Estimated Runtime | 2 hours | 8 minutes |
To be continued…
Footnotes
A superconductor is classified as any material which can transport electric charge with no resistance or with no energy loss.↩︎