Model Comparison Seminar

TU Dortmund, Summer semester 2024

Introduction to Model Comparison

Today

Introduction to Model Comparison
Discussion: Presentation topics
Plan for the next week

Those who missed the first seminar…

Please shorty introduce yourself
How much experience do you already have with model comparison?
What motivated you to take the course?
What do you expect to learn?

Basic principles

Generalizability

Find a model that generalizes well
- Minimise loss for future data
- Minimize loss for observed data, penalized by complexity
Example loss functions:
- Squared loss: \((y-\hat{y})^2\)
- log predictive density: \(\log p(y \mid M_k)\)
- AUC
- …

Bias-Variance decomposition

\[\begin{equation} E\big[(y - \hat{f}(x))^2\big] = \text{Bias}^2\big(\hat{f}(x)\big) + \text{Var}\big(\hat{f}(x)\big) + \sigma^2 \end{equation}\]

\(\hat{f}(x)\): model predictions
Bias: \(E\big(\hat{f}(x) - f(x)\big)\)
Variance: \(E\Big(\big[\hat{f}(x) - f(x)\big]^2\Big)\)
\(\sigma^2\): irreducible error

Bias-Variance and Complexity

Bigbossfarin, CC0, via Wikimedia Commons

Parsimony

“Whenever we have equivalent explanations of the observed data, the simplest one is preferable”
- Two models that fit the same data well \(\rightarrow\) the simpler should be preferred
Simple by components vs. simple by constraints
- Number of parameters, Functional form, Number of effective parameters
Prior predictive distribution:
- “Bayesian Occam’s razor” (Blanchard et al., 2018)
- Basic idea behind Bayes factors
Recent developments challenge this view (Dubova et al., 2024)
- “Complexity and parsimony as complementary principles for scientific discovery”

Why model comparison

Uncertainty in what is a good (useful) model
Different approaches:
- What are the goals of the analysis
  - Different views on parsimony, interpretability, importance of out of sample performance, loss functions, …
- Assumptions
- Frequentist vs Bayesian, Fast vs slow, Exact vs approximate,…

Why model?

Goals

Modeling goals:

Description
Explanation
Prediction
Control

Description

Want to infer about some state of the world
Emphasis on observable variables rather than latent constructs

Estimating public opinions/attributes from a survey, (micro)census, etc.
Capture empirical phenomena, establish relationships between variables
Summarise complex data structure in a compact manner

Prediction

We want to accurately predict
Emphasis on out of sample predictions
- Finding a model that minimises out of sample prediction error

Predicting election results from public polls and surveys
Filtering spam email
Predicting future sales based on the past sales

Explanation

We want to explain how things “work”
Emphasis on parsimony and fit to current data

Finding a model that “explains” the data well but is not overly complex (ideally: interpretable)
Explain decreasing decision speed for aging participants

Explanation: Decision speed

Description of phenomenon:
- Older adults typically need more time in a speeded decision tasks (Theisen et al., 2021)

Explanations:
1. Older adults have lower rate of information processing than younger adults
2. Older adults are more cautious than younger adults
3. Older adults need more time for encoding the stimulus and triggering a response than younger adults

Explanation: Decision speed

Process model (Ratcliff, 1978):

Figure from Vinding et al. (2018)

Explanation: Decision speed

Compare models (Theisen et al., 2021):
- Drift rate as a function of age (rate of information processing)
- Threshold as a function of age (response caution)
- Non-decision time as a function of age (perception, motor response)
- Combination

Control

We want to have influence over future outcomes
Strong emphasis on causality (Pearl, 1995, 2009)
- Finding a model that allows us to make good predictions under hypothethical scenarios

Propose treatments given symptoms and diagnosis
Predict effects of interventions (e.g., IPCC)

Combinations

Sometimes complementary to the point of unclear separation, e.g.
- Find a model that is interpretable but also gives accurate predictions
- Find a model that explains observed phenomena and proposes interventions
Sometimes at odds, e.g.
- Increase predictive performance can cost causal validity

Example: Increasing predictions, decreasing causal validity

\[ X \rightarrow Z \leftarrow Y \\ p(Y \mid do(X), do(Z)) = p(Y) \\ \]

\[ \begin{aligned} p(Y \mid X = x) = p(Y) \\ p(Y \mid X = x, Z = z) \neq p(Y) \end{aligned} \]

Causal perspective: conditioning on a collider (Pearl, 2009; Schneider, 2020)
Prediction perspective: okay

Modeling Goals: Take-away

Different goals might dictate which model comparison method we choose
- Not always clear cut
Need to take into account factors that are not always quantified by the model comparison technique itself

Assumptions

What world do we live in?

Different assumptions about the true data model (Bernardo & Smith, 2009):
- M-closed
- M-complete
- M-open

M-closed world

One of the candidate models \(\{M_k\}_1^K\) is the true model
It is possible to place prior probabilities on models \(p(M_k)\)
Eventually, we should be able to identify/recover the true model (consistency)

M-complete world

None of the candidate models is the true model
It is not reasonable to place probabilities on models \(p(M_k)\)
It is possible to formulate a belief model in some way
- Candidate models are used instead (for tractability, speed, convenience, knowledge) purposes

M-open world

None of the candidate models is the true model
It is not possible to formulate a belief model
Assume exchangeability of current data with future data
- Train-test, cross-validation, bootstrapping, …

Consequences

M-closed:
- One of the models will be selected even if none of them fit the data
- A strong need to evaluate “absolute” model fit
M-complete and M-open:
- There is a chance none of the models come out as “good”

Important considerations

Model checking

Convergence
Validity of computations, unit testing
Retrodictive checks/Posterior predictive checks:
- Generate predictions from the fitted model:
  - \(p(y^* \mid y) = \int p(y^* \mid \theta) \cdot p(\theta \mid y) d\theta\)
- Compare to observed data \(y\)
Absolute fit measures:
- e.g., R^2, AUC, RMSEA, Chi-squared, CFI, …

Fairness

Assess whether the model based decisions are equitable (Barocas et al., 2023; Bürkner et al., 2023)
- Especially considering attributes such as sex, gender, ethnic background
When it goes wrong \(\rightarrow\) potentially catastrophic consequences (Stanojevic, 2023; e.g., “the Dutch childcare benefits scandal,” Wikipedia, 2024)
Sometimes possible to test empirically (e.g., differential item functioning, Holland & Wainer, 2012)
Reducing bias in data sets
Removing protected attributes from model-based decisions

Robustness

How sensitive are our conclusions to changes in data or a model (Bürkner et al., 2023)?

Examples:

Prior specification
Likelihood specification (e.g. different distribution or link function in GLM)
Data perturbations, outliers, …

Speed

“Fast enough” depends on context
Simplify the models
Derive closed form solutions
Approximations (e.g., AIC instead of cross-validation, BIC instead of Bayes factors)
Amortized methods (Radev et al., 2020, 2023)
- Against intractability: Simulation-based inference (Cranmer et al., 2020)

Summary

Model comparison is not be all and end all
- Considerations external to model comparison itself
Models need to be considered in their own right
Important:
- Context
- Goals

Workflows

Modeling Workflows

Different modeling workflows (Gelman et al., 2020; Schwab & Held, 2020)
No workflow is “good” or “bad”

Exploratory

We already obtained the data, have no hypotheses or vague ideas for good model
Use the data to build model that fits the data well
Often informal use of different model comparison techniques, weak guarantees
Outcome: A set of possible models for confirmatory analysis

Confirmatory

We already have hypotheses or specific models that we want to “test”
We collect data designed to distinguish between competing models
Inflexible, strong guarantees (if done right)
Often comes with “pre-registration” protocols
Outcome: Did any of the models fit well? Which model was the “best”?

Combined

Hypotheses \(\rightarrow\) Models
Collect data
Evaluate hypotheses (confirmatory)
Explore data, fit more models, generate new hypotheses (exploratory)
Go to step 1

Iterative (Gelman et al., 2020)

Some idea of the modeling approach
Stricter workflow than exploratory, but weaker guarantees than confirmatory
Iterative refinement of the model on the same data
- Evaluate the model before fitting the data
- Validate computations
- Assess the model after fitting the data
- Compare (qualitatively and quantitatively) to previous iterations
- Address computational issues
- Justify potential modifications to the model
- Consider additional sources of information

Spliting

Training set:
- Fit models
Validation set:
- Evaluate model fit
- Formulate new models, hyperparameter tuning
Test set:
- Unbiased final evaluation of models

More flexibility, strong guarantees
Data intensive
(Potentially) computationally intensive

Take-away

Be aware of what you are doing
- How strict am I when fitting and evaluating different models?
Be wary of fooling yourself
Be transparent:
- What was your workflow
- When did you look at the data
- When and how did you split the data
- What models did you consider
- Your assumptions, hypotheses
- What are your methods, why you chose them
- …

Coming up next…

Topic suggestions

Discussion of your own topics
Submit topic suggestions by Friday, April 19th
- In Moodle - Open a new topic in the discussion forum

Next week

Discussion of the paper by Hastie et al. (2009)
- Accessible in Moodle
- Birds-eye view of basic model comparison techniques
- Do not need to memorize formulas
- Try to understand and remember the conceptual differences between different methods
Think about which topics would you want to take

References

Barocas, S., Hardt, M., & Narayanan, A. (2023). Fairness and machine learning: Limitations and opportunities. MIT Press.

Bernardo, J. M., & Smith, A. F. (2009). Bayesian theory (Vol. 405). John Wiley & Sons.

Blanchard, T., Lombrozo, T., & Nichols, S. (2018). Bayesian Occam’s razor is a razor of the people. Cognitive Science, 42(4), 1345–1359.

Bürkner, P.-C., Scholz, M., & Radev, S. T. (2023). Some models are useful, but how do we know which ones? Towards a unified bayesian model taxonomy. Statistic Surveys, 17, 216–310.

Cranmer, K., Brehmer, J., & Louppe, G. (2020). The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48), 30055–30062.

Dubova, M., Chandramouli, S., Gigerenzer, G., Grünwald, P., Holmes, W., Lombrozo, T., Marelli, M., Musslick, S., Nicenboim, B., Ross, L., Shiffrin, R., White, M., Wagenmakers, E.-J., Bürkner, P.-C., & Sloman, S. J. (2024). Is Occam’s razor losing its edge? New perspectives on the principle of model parsimony. https://doi.org/10.31222/osf.io/bs5xe

Gelman, A., Vehtari, A., Simpson, D., Margossian, C. C., Carpenter, B., Yao, Y., Kennedy, L., Gabry, J., Bürkner, P.-C., & Modrák, M. (2020). Bayesian workflow. arXiv. https://arxiv.org/abs/2011.01808

Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Tibshirani, R., & Friedman, J. (2009). Model assessment and selection. In The elements of statistical learning: data mining, inference, and prediction (pp. 219–259). Springer Series in Statistics. Springer, New York, NY. https://link.springer.com/chapter/10.1007/978-0-387-21606-5_7#preview

Holland, P. W., & Wainer, H. (2012). Differential item functioning. Routledge.

Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4), 669–688.

Pearl, J. (2009). Causal inference in statistics: An overview. Statistics Surveys, 3, 96–146.

Radev, S. T., Mertens, U. K., Voss, A., Ardizzone, L., & Köthe, U. (2020). BayesFlow: Learning complex stochastic models with invertible neural networks. IEEE Transactions on Neural Networks and Learning Systems, 33(4), 1452–1466.

Radev, S. T., Schmitt, M., Pratz, V., Picchini, U., Koethe, U., & Buerkner, P.-C. (2023). JANA: Jointly amortized neural approximation of complex bayesian models. The 39th Conference on Uncertainty in Artificial Intelligence. https://openreview.net/forum?id=dS3wVICQrU0

Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85(2), 59.

Schneider, E. B. (2020). Collider bias in economic history research. Explorations in Economic History, 78, 101356.

Schwab, S., & Held, L. (2020). Different worlds confirmatory versus exploratory research. Significance, 17(2), 8–9.

Stanojevic, A. (2023). Algorithmic governance and social vulnerability: A value analysis of equality, freedom and trust. Available at SSRN.

Theisen, M., Lerche, V., Krause, M. von, & Voss, A. (2021). Age differences in diffusion model parameters: A meta-analysis. Psychological Research, 85, 2012–2021.

Vinding, M. C., Lindeløv, J. K., Xiao, Y., Chan, R. C., & Sørensen, T. A. (2018). Volition in prospective memory: Evidence against differences in recalling free and fixed delayed intentions. PsyArXiv. https://doi.org/10.31234/osf.io/hsrbt

Wikipedia. (2024). Dutch childcare benefits scandal — Wikipedia, the free encyclopedia. http://en.wikipedia.org/w/index.php?title=Dutch%20childcare%20benefits%20scandal&oldid=1208442475.