Model Comparison Seminar

TU Dortmund, Summer semester 2024

Introduction to Model Comparison

Today

  1. Introduction to Model Comparison
  2. Discussion: Presentation topics
  3. Plan for the next week

Those who missed the first seminar…

  • Please shorty introduce yourself
  • How much experience do you already have with model comparison?
  • What motivated you to take the course?
  • What do you expect to learn?

Basic principles

Generalizability

  • Find a model that generalizes well
    • Minimise loss for future data
    • Minimize loss for observed data, penalized by complexity
  • Example loss functions:
    • Squared loss: \((y-\hat{y})^2\)
    • log predictive density: \(\log p(y \mid M_k)\)
    • AUC

Bias-Variance decomposition

\[\begin{equation} E\big[(y - \hat{f}(x))^2\big] = \text{Bias}^2\big(\hat{f}(x)\big) + \text{Var}\big(\hat{f}(x)\big) + \sigma^2 \end{equation}\]

  • \(\hat{f}(x)\): model predictions
  • Bias: \(E\big(\hat{f}(x) - f(x)\big)\)
  • Variance: \(E\Big(\big[\hat{f}(x) - f(x)\big]^2\Big)\)
  • \(\sigma^2\): irreducible error

Bias-Variance and Complexity

Bigbossfarin, CC0, via Wikimedia Commons

Parsimony

  • “Whenever we have equivalent explanations of the observed data, the simplest one is preferable”
    • Two models that fit the same data well \(\rightarrow\) the simpler should be preferred
  • Simple by components vs. simple by constraints
    • Number of parameters, Functional form, Number of effective parameters
  • Prior predictive distribution:
  • Recent developments challenge this view (Dubova et al., 2024)
    • “Complexity and parsimony as complementary principles for scientific discovery”

Why model comparison

  • Uncertainty in what is a good (useful) model
  • Different approaches:
    • What are the goals of the analysis
      • Different views on parsimony, interpretability, importance of out of sample performance, loss functions, …
    • Assumptions
    • Frequentist vs Bayesian, Fast vs slow, Exact vs approximate,…

Why model?

Goals

Modeling goals:

  1. Description
  2. Explanation
  3. Prediction
  4. Control

Description

  • Want to infer about some state of the world
  • Emphasis on observable variables rather than latent constructs
  • Estimating public opinions/attributes from a survey, (micro)census, etc.
  • Capture empirical phenomena, establish relationships between variables
  • Summarise complex data structure in a compact manner

Prediction

  • We want to accurately predict
  • Emphasis on out of sample predictions
    • Finding a model that minimises out of sample prediction error
  • Predicting election results from public polls and surveys
  • Filtering spam email
  • Predicting future sales based on the past sales

Explanation

  • We want to explain how things “work”
  • Emphasis on parsimony and fit to current data
  • Finding a model that “explains” the data well but is not overly complex (ideally: interpretable)
  • Explain decreasing decision speed for aging participants

Explanation: Decision speed

  • Explanations:
    1. Older adults have lower rate of information processing than younger adults
    2. Older adults are more cautious than younger adults
    3. Older adults need more time for encoding the stimulus and triggering a response than younger adults

Explanation: Decision speed

Figure from Vinding et al. (2018)

Explanation: Decision speed

  • Compare models (Theisen et al., 2021):
    • Drift rate as a function of age (rate of information processing)
    • Threshold as a function of age (response caution)
    • Non-decision time as a function of age (perception, motor response)
    • Combination

Control

  • We want to have influence over future outcomes
  • Strong emphasis on causality (Pearl, 1995, 2009)
    • Finding a model that allows us to make good predictions under hypothethical scenarios
  • Propose treatments given symptoms and diagnosis
  • Predict effects of interventions (e.g., IPCC)

Combinations

  • Sometimes complementary to the point of unclear separation, e.g.
    • Find a model that is interpretable but also gives accurate predictions
    • Find a model that explains observed phenomena and proposes interventions
  • Sometimes at odds, e.g.
    • Increase predictive performance can cost causal validity

Example: Increasing predictions, decreasing causal validity

\[ X \rightarrow Z \leftarrow Y \\ p(Y \mid do(X), do(Z)) = p(Y) \\ \]

\[ \begin{aligned} p(Y \mid X = x) = p(Y) \\ p(Y \mid X = x, Z = z) \neq p(Y) \end{aligned} \]

Modeling Goals: Take-away

  • Different goals might dictate which model comparison method we choose
    • Not always clear cut
  • Need to take into account factors that are not always quantified by the model comparison technique itself

Assumptions

What world do we live in?

M-closed world

  • One of the candidate models \(\{M_k\}_1^K\) is the true model
  • It is possible to place prior probabilities on models \(p(M_k)\)
  • Eventually, we should be able to identify/recover the true model (consistency)

M-complete world

  • None of the candidate models is the true model
  • It is not reasonable to place probabilities on models \(p(M_k)\)
  • It is possible to formulate a belief model in some way
    • Candidate models are used instead (for tractability, speed, convenience, knowledge) purposes

M-open world

  • None of the candidate models is the true model
  • It is not possible to formulate a belief model
  • Assume exchangeability of current data with future data
    • Train-test, cross-validation, bootstrapping, …

Consequences

  • M-closed:
    • One of the models will be selected even if none of them fit the data
    • A strong need to evaluate “absolute” model fit
  • M-complete and M-open:
    • There is a chance none of the models come out as “good”

Important considerations

Model checking

  • Convergence
  • Validity of computations, unit testing
  • Retrodictive checks/Posterior predictive checks:
    • Generate predictions from the fitted model:
      • \(p(y^* \mid y) = \int p(y^* \mid \theta) \cdot p(\theta \mid y) d\theta\)
    • Compare to observed data \(y\)
  • Absolute fit measures:
    • e.g., R^2, AUC, RMSEA, Chi-squared, CFI, …

Fairness

  • Assess whether the model based decisions are equitable (Barocas et al., 2023; Bürkner et al., 2023)
    • Especially considering attributes such as sex, gender, ethnic background
  • When it goes wrong \(\rightarrow\) potentially catastrophic consequences (Stanojevic, 2023; e.g., “the Dutch childcare benefits scandal,” Wikipedia, 2024)
  • Sometimes possible to test empirically (e.g., differential item functioning, Holland & Wainer, 2012)
  • Reducing bias in data sets
  • Removing protected attributes from model-based decisions

Robustness

Examples:

  • Prior specification
  • Likelihood specification (e.g. different distribution or link function in GLM)
  • Data perturbations, outliers, …

Speed

  • “Fast enough” depends on context
  • Simplify the models
  • Derive closed form solutions
  • Approximations (e.g., AIC instead of cross-validation, BIC instead of Bayes factors)
  • Amortized methods (Radev et al., 2020, 2023)

Summary

  • Model comparison is not be all and end all
    • Considerations external to model comparison itself
  • Models need to be considered in their own right
  • Important:
    • Context
    • Goals

Workflows

Modeling Workflows

Exploratory

  • We already obtained the data, have no hypotheses or vague ideas for good model
  • Use the data to build model that fits the data well
  • Often informal use of different model comparison techniques, weak guarantees
  • Outcome: A set of possible models for confirmatory analysis

Confirmatory

  • We already have hypotheses or specific models that we want to “test”
  • We collect data designed to distinguish between competing models
  • Inflexible, strong guarantees (if done right)
  • Often comes with “pre-registration” protocols
  • Outcome: Did any of the models fit well? Which model was the “best”?

Combined

  1. Hypotheses \(\rightarrow\) Models
  2. Collect data
  3. Evaluate hypotheses (confirmatory)
  4. Explore data, fit more models, generate new hypotheses (exploratory)
  5. Go to step 1

Iterative (Gelman et al., 2020)

  • Some idea of the modeling approach
  • Stricter workflow than exploratory, but weaker guarantees than confirmatory
  • Iterative refinement of the model on the same data
    • Evaluate the model before fitting the data
    • Validate computations
    • Assess the model after fitting the data
    • Compare (qualitatively and quantitatively) to previous iterations
    • Address computational issues
    • Justify potential modifications to the model
    • Consider additional sources of information

Spliting

  1. Training set:
    • Fit models
  2. Validation set:
    • Evaluate model fit
    • Formulate new models, hyperparameter tuning
  3. Test set:
    • Unbiased final evaluation of models
  • More flexibility, strong guarantees
  • Data intensive
  • (Potentially) computationally intensive

Take-away

  • Be aware of what you are doing
    • How strict am I when fitting and evaluating different models?
  • Be wary of fooling yourself
  • Be transparent:
    • What was your workflow
    • When did you look at the data
    • When and how did you split the data
    • What models did you consider
    • Your assumptions, hypotheses
    • What are your methods, why you chose them

Coming up next…

Topic suggestions

  • Discussion of your own topics
  • Submit topic suggestions by Friday, April 19th

Next week

  • Discussion of the paper by Hastie et al. (2009)
    • Accessible in Moodle
    • Birds-eye view of basic model comparison techniques
    • Do not need to memorize formulas
    • Try to understand and remember the conceptual differences between different methods
  • Think about which topics would you want to take

References

Barocas, S., Hardt, M., & Narayanan, A. (2023). Fairness and machine learning: Limitations and opportunities. MIT Press.
Bernardo, J. M., & Smith, A. F. (2009). Bayesian theory (Vol. 405). John Wiley & Sons.
Blanchard, T., Lombrozo, T., & Nichols, S. (2018). Bayesian Occam’s razor is a razor of the people. Cognitive Science, 42(4), 1345–1359.
Bürkner, P.-C., Scholz, M., & Radev, S. T. (2023). Some models are useful, but how do we know which ones? Towards a unified bayesian model taxonomy. Statistic Surveys, 17, 216–310.
Cranmer, K., Brehmer, J., & Louppe, G. (2020). The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48), 30055–30062.
Dubova, M., Chandramouli, S., Gigerenzer, G., Grünwald, P., Holmes, W., Lombrozo, T., Marelli, M., Musslick, S., Nicenboim, B., Ross, L., Shiffrin, R., White, M., Wagenmakers, E.-J., Bürkner, P.-C., & Sloman, S. J. (2024). Is Occam’s razor losing its edge? New perspectives on the principle of model parsimony. https://doi.org/10.31222/osf.io/bs5xe
Gelman, A., Vehtari, A., Simpson, D., Margossian, C. C., Carpenter, B., Yao, Y., Kennedy, L., Gabry, J., Bürkner, P.-C., & Modrák, M. (2020). Bayesian workflow. arXiv. https://arxiv.org/abs/2011.01808
Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Tibshirani, R., & Friedman, J. (2009). Model assessment and selection. In The elements of statistical learning: data mining, inference, and prediction (pp. 219–259). Springer Series in Statistics. Springer, New York, NY. https://link.springer.com/chapter/10.1007/978-0-387-21606-5_7#preview
Holland, P. W., & Wainer, H. (2012). Differential item functioning. Routledge.
Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4), 669–688.
Pearl, J. (2009). Causal inference in statistics: An overview. Statistics Surveys, 3, 96–146.
Radev, S. T., Mertens, U. K., Voss, A., Ardizzone, L., & Köthe, U. (2020). BayesFlow: Learning complex stochastic models with invertible neural networks. IEEE Transactions on Neural Networks and Learning Systems, 33(4), 1452–1466.
Radev, S. T., Schmitt, M., Pratz, V., Picchini, U., Koethe, U., & Buerkner, P.-C. (2023). JANA: Jointly amortized neural approximation of complex bayesian models. The 39th Conference on Uncertainty in Artificial Intelligence. https://openreview.net/forum?id=dS3wVICQrU0
Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85(2), 59.
Schneider, E. B. (2020). Collider bias in economic history research. Explorations in Economic History, 78, 101356.
Schwab, S., & Held, L. (2020). Different worlds confirmatory versus exploratory research. Significance, 17(2), 8–9.
Stanojevic, A. (2023). Algorithmic governance and social vulnerability: A value analysis of equality, freedom and trust. Available at SSRN.
Theisen, M., Lerche, V., Krause, M. von, & Voss, A. (2021). Age differences in diffusion model parameters: A meta-analysis. Psychological Research, 85, 2012–2021.
Vinding, M. C., Lindeløv, J. K., Xiao, Y., Chan, R. C., & Sørensen, T. A. (2018). Volition in prospective memory: Evidence against differences in recalling free and fixed delayed intentions. PsyArXiv. https://doi.org/10.31234/osf.io/hsrbt
Wikipedia. (2024). Dutch childcare benefits scandalWikipedia, the free encyclopedia. http://en.wikipedia.org/w/index.php?title=Dutch%20childcare%20benefits%20scandal&oldid=1208442475.