There are 5 Methods in Resampling that involve the retraining of a selected model using subsets extracted from the training dataset, with the objective of gaining further insights into the model’s performance. This process aims to provide additional information about the fitted model, including estimating prediction errors on a test dataset and evaluating the standard deviation and bias associated with parameter estimates.
1)Distinguishing Between Test and Training Errors:
Test error signifies the average error incurred when applying a statistical learning method to predict responses for new observations not included in the model’s training process. Conversely, training error can be readily calculated by applying the same method to the data used during the model’s training phase. It’s noteworthy that the training error rate often differs significantly from the test error rate, with the former frequently underestimating the latter.
2)Balancing Bias and Variance:
Bias and variance collectively contribute to prediction errors, and a trade-off between them must be minimized. These factors converge to determine the test error.
3)Validation Set Approach:
In this approach, the available dataset is randomly partitioned into two subsets: a training set and a validation (or hold-out) set. The model is trained on the training set, and the resultant model is utilized to predict responses for the observations in the validation set. The validation set error furnishes an estimate of the test error. This estimation typically employs mean squared error (MSE) for quantitative responses and the misclassification rate for categorical responses.
4)Limitations of the Validation Approach:
The validation estimate of the test error may exhibit significant variability, contingent on the specific observations included in the training and validation sets. In this approach, only a subset of the observations—those in the training set rather than the validation set—is employed to fit the model. Consequently, the validation set error may tend to overestimate the test error for the model fitted to the entire dataset. This discrepancy arises because a larger dataset generally leads to lower error rates.
5)K-Fold Cross-Validation:
K-Fold Cross-Validation stands as a widely adopted technique for estimating test error. The fundamental idea involves randomly dividing the data into K equal-sized segments or folds. One of these segments (part k) is excluded, and the model is trained using the remaining k-1 segments. Predictions are then generated for the omitted Kth segment. This process is iteratively repeated for each segment (K = 1, 2, 3, …, K), and the results are aggregated. Since each training set is only (k-1)/K the size of the original training set, prediction error estimates frequently display an upward bias. This bias is minimized when K equals the total number of data points (K = n), but this estimate is associated with high variance.