Topics
Here we list a selection of possible topics, alongside with some basic literature that you can use to get familiar with the topic.
Basic literature
The following resources discuss model comparison (or certain issues) from a broad perspective
Bürkner, P.-C., Scholz, M., & Radev, S. T. (2023). Some models are useful, but how do we know which ones? Towards a unified bayesian model taxonomy. Statistic Surveys, 17, 216–310.
Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231.
Dubova, M., Chandramouli, S., Gigerenzer, G., Grünwald, P., Holmes, W., Lombrozo, T., Marelli, M., Musslick, S., Nicenboim, B., Ross, L., Shiffrin, R., White, M., Wagenmakers, E.-J., Bürkner, P.-C., & Sloman, S. J. (2024). Is Occam’s razor losing its edge? New perspectives on the principle of model parsimony. https://doi.org/10.31222/osf.io/bs5xe
McElreath, R. (2023). Statistical rethinking 2023 - 07 - Fitting Over & Under. Richard McElreath channel on YouTube. https://youtu.be/1VgYIsANQck?si=dsRgGkRlyCAcB0xG
Navarro, D. J. (2019). Between the devil and the deep blue sea: Tensions between scientific judgement and statistical model selection. Computational Brain & Behavior, 2(1), 28–34. https://doi.org/10.1007/s42113-018-0019-z
Shmueli, G. (2010). To Explain or to Predict? Statistical Science, 25(3), 289–310. https://doi.org/10.1214/10-STS330
You can find helpful review of different techniques in
Ding, J., Tarokh, V., & Yang, Y. (2018). Model selection techniques: An overview. IEEE Signal Processing Magazine, 35(6), 16–34. https://arxiv.org/abs/1810.09583
Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Tibshirani, R., & Friedman, J. (2009). Model assessment and selection. In The elements of statistical learning: data mining, inference, and prediction (pp. 219–259). Springer Series in Statistics. Springer, New York, NY. https://link.springer.com/chapter/10.1007/978-0-387-21606-5_7#preview
Individual methods
Below is a list of model comparison techniques. This list is by no means exhaustive; you are welcome to suggest other techniques that you would like to cover.
Cross validation
Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Tibshirani, R., & Friedman, J. (2009). Model assessment and selection. In The elements of statistical learning: data mining, inference, and prediction (pp. 219–259). Springer Series in Statistics. Springer, New York, NY. https://link.springer.com/chapter/10.1007/978-0-387-21606-5_7#preview
Bootstrapping
Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Tibshirani, R., & Friedman, J. (2009). Model assessment and selection. In The elements of statistical learning: data mining, inference, and prediction (pp. 219–259). Springer Series in Statistics. Springer, New York, NY. https://link.springer.com/chapter/10.1007/978-0-387-21606-5_7#preview
AIC
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723.
Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Tibshirani, R., & Friedman, J. (2009). Model assessment and selection. In The elements of statistical learning: data mining, inference, and prediction (pp. 219–259). Springer Series in Statistics. Springer, New York, NY. https://link.springer.com/chapter/10.1007/978-0-387-21606-5_7#preview
Shmueli, G. (2010). To Explain or to Predict? Statistical Science, 25(3), 289–310. https://doi.org/10.1214/10-STS330
BIC
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 461–464.
Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Tibshirani, R., & Friedman, J. (2009). Model assessment and selection. In The elements of statistical learning: data mining, inference, and prediction (pp. 219–259). Springer Series in Statistics. Springer, New York, NY. https://link.springer.com/chapter/10.1007/978-0-387-21606-5_7#preview
Shmueli, G. (2010). To Explain or to Predict? Statistical Science, 25(3), 289–310. https://doi.org/10.1214/10-STS330
Methods for time-series
Bergmeir, C., Hyndman, R. J., & Koo, B. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120, 70–83.
Bürkner, P.-C., Gabry, J., & Vehtari, A. (2020). Approximate leave-future-out cross-validation for Bayesian time series models. Journal of Statistical Computation and Simulation, 90(14), 2499–2523.
Hyndman, R., & Athanasopoulos, G. (2021). Forecasting: Principles and practice (3rd ed.). OTexts, Melbourne, Australia. https://otexts.com/fpp3/
The following topics are somewhat advanced and will require more work (or prior familiarity with Bayesian statistics) to cover well. On the other hand, they are arguably also more fun.
DIC
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & Van Der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society Series B: Statistical Methodology, 64(4), 583–639.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & Linde, A. (2014). The deviance information criterion: 12 years on. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(3), 485–493.
WAIC (+WBIC)
Watanabe, S., & Opper, M. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11(12).
Watanabe, S. (2013). A widely applicable Bayesian information criterion. The Journal of Machine Learning Research, 14(1), 867–897.
Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27, 1413–1432.
PSIS-LOO
Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27, 1413–1432.
Bayes factors
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795.
Berger, J., & Pericchi, L. (2014). Bayes factors. Wiley StatsRef: Statistics Reference Online, 1–14.
Amortized methods
Radev, S. T., D’Alessandro, M., Mertens, U. K., Voss, A., Köthe, U., & Bürkner, P.-C. (2021). Amortized bayesian model comparison with evidential deep learning. IEEE Transactions on Neural Networks and Learning Systems, 34(8), 4903–4917.
Elsemüller, L., Schnuerch, M., Bürkner, P.-C., & Radev, S. T. (2023). A deep learning method for comparing bayesian hierarchical models. arXiv:2301.11873. https://arxiv.org/abs/2301.11873
Minimum description length
Grünwald, P. (2000). Model selection based on minimum description length. Journal of Mathematical Psychology, 44(1), 133–152.
Grünwald, P. (2005). Minimum description length tutorial. Advances in Minimum Description Length: Theory and Applications, 5, 1–80.
Grünwald, P., & Roos, T. (2019). Minimum description length revisited. International Journal of Mathematics for Industry, 11(01), 1930001.
Shiffrin, R. M., Chandramouli, S. H., & Grünwald, P. D. (2016). Bayes factors, relations to minimum description length, and overlapping model classes. Journal of Mathematical Psychology, 72, 56–77.
Model averaging
Hinne, M., Gronau, Q. F., Bergh, D. van den, & Wagenmakers, E.-J. (2020). A conceptual introduction to Bayesian model averaging. Advances in Methods and Practices in Psychological Science, 3(2), 200–215.
Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14(4), 382–417.
Model stacking
Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2018). Using stacking to average Bayesian predictive distributions. Bayesian Analysis, 13(3), 917–1007. https://projecteuclid.org/euclid.ba/1516093227
Comparison of different approaches
This may seem like more work (“why cover two topics if I can cover one”), but just like when trying to understand behavior of statistical models, sometimes a comparison between different alternatives makes it easier to understand benefits and drawbacks of each individually.
Covering a wider topic also makes it easier to focus on the big picture and leaves less space for you to dive deep into a singular topic, allowing you to focus more on the practical aspects rather than the theoretical justification of the method.
Instead of covering a single method for model comparison, you can also pick two (or three) methods, and discuss how do they differ, in what context one should choose one over another, etc.
For example:
- AIC vs BIC: What are the their respective goals? How do they differ in terms of penalizing model complexity? When should one use AIC and when BIC?
- BIC vs Bayes factors: Is BIC really Bayesian? Why or why not? How does BIC relate to Bayes factors? When would they give different answers?
- MDL vs Bayes factors: How does a minimum description length relate to Bayes factors?
- Model averaging vs Model stacking vs Model selection: What are the differences between the different approaches? When would you use one over another approach?
- AIC vs Cross-validation: How does AIC relate to LOO-CV? Can you use them interchangeably? Why or why not? When would you chose one or the other?
For literature, can start by looking into the resources that are listed under the relevant subsections in Section 2.