LLMs as general reasoners
In ML it’s common to measure generalization performance as the delta between the training, validation and test dataset. Training data is what your model actively optimizes against using some optimization procedure (e.g. SGD or conjugate gradient or quadratic programming). The validation data is used to get an unbiased estimate of the performance of the model on unseen data. Since the process of selection (hyper parameter selection, stopping criteria) invariably overfits the model to the validation data at the very end we get an estimate of model performance through the test dataset that should not be used for the process of training a model....