An Introduction to Being Less Wrong

Bayesian Hypothesis Testing and Model Averaged Marginal Effects for the Social Sciences

This post provides an introduction to Bayesian Model Averaging and Model Averaged Marginal Effects with Stan, brms, and marginaleffects

Author

A. Jordan Nafa

Published

December 31, 2022

Introduction

George E. P. Box famously described the problem of model development in statistics and related fields by noting, all models are wrong, but some are useful (1976). That is, all models are necessarily simplifications of a much more complex reality and statistics is by no means an algorithmic truth generating process because no model can ever be true in the pure sense of the word. Yet, this at times underappreciated reality does not render all models useless but rather implies the goal of those who practice statistics in its various forms is to develop and identify models that are useful in a never ending quest to be less wrong. In this post I introduce an approach to this task that has been the subject of my own dissertation work on the use of Bayesian Model Averaging (BMA) to account for uncertainty in model specification when estimating average marginal effects in non-linear regression models.

I outline here what I term a Bayesian Model Averaged Marginal Effect (BMAME) in the context of logistic regression models in political science and illustrate how to obtain BMAMEs in a straightforward manner using the marginaleffects R package (Arel-Bundock 2022), which thanks to Vincent Arel-Bundock’s assistance supports model averaged or stacked average marginal effects for any model fit with the Stan interface brms (Bürkner 2017, 2018). I also provide a more general demonstration in Stan that makes the method easily applicable in Python and high dimensional settings. Although I focus here on models that include only population-level effects, recent feature additions to marginaleffects now make it possible to obtain BMAMEs from more complex hierarchical models and I will cover such applications in a subsequent post.

Bayesian Model Averaging in Political Science

Although Bayesian Model Averaging was introduced to political science more than two decades ago by Bartels (1997), cases of its use in applied work remain rare and are largely confined to the topic of political methodology (Cranmer, Rice, and Siverson 2015; Juhl 2019; Montgomery and Nyhan 2010).¹ This presents something of a problem given the numerous studies demonstrating the dangers of tossing a large number of likely correlated predictors into a regression model (Achen 2005; Clarke 2005, 2009; Montgomery, Nyhan, and Torres 2018); the reality that if we wish to adjudicate between two or more competing theories, comparing coefficients in a single model is generally insufficient to accomplish such a task (Clarke 2007; Hollenbach and Montgomery 2020; Imai and Tingley 2011); and the difficulty of assessing what we truly known about political phenomenon that results from an obsession with “statistical significance” and precise answers to poorly defined questions (Lundberg, Johnson, and Stewart 2021; Schrodt 2014; Ward, Greenhill, and Bakke 2010).

In minor contrast to the past advice of Montgomery and Nyhan (2010) who suggest “BMA is best used as a subsequent robustness check to show that our inferences are not overly sensitive to plausible variations in model specification” (266), I argue here model averaging can and should be used as far more than a robustness check if for no other reason than a model-wise mixture distribution for the paramter of interest is almost certainly more informative than a point estimate from a single model specification. Indeed, BMA provides one of the few natural and intuitive ways of resolving some of the issues outlined in the preceding paragraph, particularly as it pertains to the evaluation of non-nested theoretical models, while also exhibiting a lower false positive rate than alternative approaches (Plümper and Traunmüller 2018). This section provides a brief explanation of BMA and its connection to model averaged marginal effects before proceeding to a discussion of how the procedure is implemented.

Bayesian Model Comparison and Model Averaging

Consider a simple case in which we have a set of $M$ models, each of which characterizes a possible representation of some unknown data generation process. Although it remains common practice in applied political science to select a single model based on a largely arbitrary fit statistic (i.e., AIC, $χ^{2}$ , or whatever people are getting a p-value from this week), such an approach is problematic since it ignores the uncertainty inherent in the process of model specification. Rather than selecting a single model and consequently placing an implicit zero prior on every other possible model specification you could have reasonably considered but did not, Bayesian Model Averaging and its analogues provide a means of averaging across several possible, potentially non-nested, model specifications and accounting for the uncertainty associated with doing so.

Imagine we wish to adjudicate between two competing theoretical models in the set of candidate models $M$ . The posterior odds of model $i$ relative to an alternative $k$ can be expressed as

$\begin{matrix} (1) & \underset{Posterior Odds}{\underset{⏟}{\frac{Pr (M_{i} | y)}{Pr (M_{k} | y)}}} = \underset{Bayes Factor}{\underset{⏟}{\frac{p (y | M_{i})}{p (y | M_{k})}}} \times \underset{Model Prior}{\underset{⏟}{\frac{π (M_{i})}{π (M_{k})}}} \end{matrix}$

where $π (M)$ is the prior probability of model $i$ over model $k$ and $p (y | M)$ is the marginal likelihood of the observed data under each of the candidate models such that

$\begin{matrix} (2) & \underset{Marginal Likelihood}{\underset{⏟}{p (y | M)}} = \int \underset{Likelihood}{\underset{⏟}{p (y | θ, M)}} \underset{Prior}{\underset{⏟}{π (θ | M)}} d θ \end{matrix}$

Given our prior assumptions about how likely the observed data are to have arisen from each model in the set $M$ , the result of equation 1 is the posterior odds of model $i$ compared to model $k$ and thus captures the relative probability that model $M_{i}$ represents the true data generation process compared to the plausible alternative $M_{k}$ . We can easily extend this example to the setting in which $M_{k}$ is a set of plausible competing models of size $k > 2$ , in which case the posterior model probability of the $i^{t h}$ model relative to each of the alternatives $k$ is

$\begin{matrix} (3) & Pr (M_{i} | y) = \frac{p (y | M_{i}) \cdot π (M_{i})}{\sum_{k = 1}^{m} p (y | M_{k}) \cdot π (M_{k})} \end{matrix}$

Applying equation 3 to each model in the set under consideration yields a vector of relative posterior model probabilities of length $m$ which may be used as either posterior probability weights as in the case of Bayesian Model Averaging or for the estimation of posterior inclusion probabilities.

In the context of model averaging, we can take draws from the posterior predictive distribution of each model containing a given parameter of interest proportional to its posterior probability weight obtained from equation 3. This yields a model-wise mixture representing a weighted average of the posterior predictive distribution for some parameter such as a regression coefficient for a linear model or as I outline below, an average marginal effect or probability contrast for models that employ a non-linear link function.

Bayesian Model Averaged Marginal Effects

Average marginal effects are ubquitous in the social sciences though, as Andrew Heiss nicely illustrates, the term average marginal effect is often used ambiguously and in the context of more complex models, the terms marginal and conditional tend to be a source of additional confusion. In the case of a simple population-level linear model with a Gaussian likelihood and identity link function, the model averaged marginal effect is equivalent to the posterior distribution of the model averaged coefficient $β$ which can be expressed as

$\begin{matrix} (4) & E (β | y) = \sum_{i = 1}^{m} Pr (M_{i} | y) \cdot E (β | M_{i}, y) \forall i \in {1, 2, \dots, m} \end{matrix}$

where $Pr (M_{i} | y)$ represent the posterior model probability and $E (β | M_{i}, y)$ is the expected value of some parameter $β$ conditional on the observed data and some model specification for each model $i$ in the set of models under consideration $m$ (Montgomery and Nyhan 2010).

For probability models such as logit, and other commonly used formulations for modeling discrete responses, this simple closed form solution does not exist nor is there any meaningful way to interpret the coefficients of these models as efffect size estimates–a reality that unfortunately remains quite widely misunderstood (Daniel, Zhang, and Farewell 2020; Mood 2009). As a general principle, any model more complex than a simple linear regression can only be meaningfully interpreted in terms of the predictions it generates (Long and Mustillo 2018), which in the case of logistic regression simply means applying the inverse logistic function to obtain predicted probabilities which we can then use to estimate contrasts or average marginal effects (Norton, Dowd, and Maciejewski 2019). Letting $Y$ represent a binary outcome, $X$ a continuous exposure or treatment of interest, and $Z$ a matrix of measured confounders, we can express the average marginal effect as

$\begin{matrix} (5) & AME Δ_{j} = \int \frac{E [Y_{i j} | X_{i j} = x_{i j} + h, Z_{i j}] - E [Y_{i j} | X_{i j} = x_{i j}, Z_{i j}] d Z}{h} \end{matrix}$

which provided a sufficiently small value of $h$ , yields a reasonable approximation of the partial derivative and provides a posterior distribution for the average effect of an instantaneous change in $X$ on the probability scale similar to what one obtains in linear regression.

Similarly, for a binary treatment of interest we apply the Bayesian g-formula to obtain the average marginal effect of $X$ on the probability of $Y$ as shown in equation 6.

$\begin{matrix} (6) & AME Δ_{j} = \int E [Y_{i j} | X_{i j} = 1, Z_{i j}] - E [Y_{i j} | X_{i j} = 0, Z_{i j}] d Z \end{matrix}$

Equations 5 and 6 make clear we are averaging over the confounder distribution $Z$ rather than holding it constant at some fixed value (Keil et al. 2017; Oganisian and Roy 2020). This point is an important one because in binary outcome models such as logit and probit, holding $Z$ constant at some fixed value such as the mean corresponds to an entirely different estimand (Hanmer and Kalkan 2012). The AME is a population-averaged quantity, corresponding to the population average treatment effect and these two quantities can produce very different results under fairly general conditions because they do not answer the same question.

From here it is straightforward to obtain a model averaged marginal effect estimate for a discrete outcome model such as logit. Given marginal predictions for each model $k$ in the set $m$ , we can apply equation 7 to obtain the posterior distribution of average marginal effects for each model.

$\begin{matrix} (7) & \begin{aligned} BMAME Δ_{j} & = \sum_{k = 1}^{m} Pr (M_{k} | y) \cdot \int E [Y_{i j}, M_{k} | X_{i j} = 1, Z_{i j}] - E [Y_{i j}, M_{k} | X_{i j} = 0, Z_{i j}] d Z \\ = \sum_{k = 1}^{m} Pr (M_{k} | y) \cdot AME Δ_{k, j} \end{aligned} \end{matrix}$

This yields a model-wise mixture distribution representing the weighted average of the marginal effects estimates and has broad applicability to scenarios common in the social sciences, industry, and any other setting where a need arises to account for uncertainty in model specification.

Limitations and Caveats

While traditional marginal likelihood-based Bayesian model averaging is a powerful and principled technique for dealing with uncertainty in the process of model specification and selection, it is not without limitations and caveats of its own, some of which bear emphasizing before proceeding further.

Marginal Likelihood and Computational Uncertainty

First, for all but the simplest of models attempting to derive the integral in equation 2 analytically proves to be computationally intractable and we must instead rely on algorithmic approximations such as bridge sampling (Gelman and Meng 1998; Gronau et al. 2017; Wang, Jones, and Meng 2020). While the bridge sampling approximation tends to perform well compared to common alternatives, estimates of the marginal likelihood $p (y | M)$ may be highly variable upon repeated runs of the algorithm (Schad et al. 2022). This problem tends to arise in more complex hierarchical models with multiple varying effects and, perhaps unsurprisingly, suggests that by relying on approximations we introduce an additional source of computational uncertainty that may need to be accounted for.

For the purposes of BMAMEs, I have elsewhere proposed addressing this additional source of uncertainty and incorporating it into the estimates by executing the bridge sampling algorithm $S$ times for each model to obtain a distribution of possible values for the approximate log marginal likelihood. Given a vector of estimates for $p (y | M)$ of length $s$ for each model $k \in {1, 2, \dots, m}$ , we can apply equation 3 to obtain an $s \times m$ matrix of posterior model probabilities. It is then straightforward to extend equation 7 to average over the distribution of posterior probability weights rather than relying on single estimate or point estimate as shown in equation 8 below.²

$\begin{matrix} (8) & \begin{aligned} BMAME Δ_{j} & = \frac{1}{S} \sum_{s = 1}^{S} [\sum_{k = 1}^{m} Pr (M_{k, s} | y) \cdot \int E [Y_{i j}, M_{k} | X_{i j} = 1, Z_{i j}] - E [Y_{i j}, M_{k} | X_{i j} = 0, Z_{i j}] d Z] \\ = \frac{1}{S} \sum_{s = 1}^{S} [\sum_{k = 1}^{m} Pr (M_{k, s} | y) \cdot AME Δ_{k, j}] \end{aligned} \end{matrix}$

Whether this approach is necessary depends largely on how variable the algorithm is on repeated runs and it may simply be adequate to instead take the median or mean of the marginal likelihood estimates for each model. Although it is beyond the scope of this post, one could also use exhaustive leave-future-out cross-validation which is equivalent to the marginal likelihood given a logarithmic scoring rule (Bürkner, Gabry, and Vehtari 2020; Fong and Holmes 2020), though this approach tends to be extremely expensive in terms of computation.

Bayes Factors and Prior Sensitivity

The second caveat with respect to obtaining BMAMEs stems from a common criticism of Bayes Factors. Although as the amount of available data $n ⟶ \infty$ , the likelihood will tend to dominate the prior in Bayesian estimation, this is not necessarily the case in terms of inference. Indeed, from equations 2 and 3 one can quite clearly see the the posterior probability $Pr (M | y)$ depends on the prior probability of the parameters, $π (θ | M)$ in equation 2, and the respective prior probability $π (M)$ of each model in the set under consideration. Since our substantive conclusions may be sensitive to either of these prior specifications, these assumptions should be checked to verify the results do not vary wildly under reasonable alternatives.

Perhaps more importantly, due to their dependence on Bayes Factors, posterior probability weights require at least moderatlely informative priors on each and every parameter in a model and it is by this point common knowledge that specifying a flat or “uninformative” prior such as $β \sim Uniform (- \infty, \infty)$ or even an excessively vague one tends to bias Bayes Factors–and by extension values that depend on them–violently in favor of the null model.³ This is a feature of applied Bayesian inference rather than a bug and if you make stupid assumptions, you will arrive at stupid conclusions. The solution to this problem is to think carefully about the universe of possible effect sizes you might observe for a given parameter, assign reasonable priors that constrain the parameter space, and verify the results are robust to alternative distributional assumptions.

Likeihoods and Priors

The degree to which a prior can be considered informative is heavily dependent on the likelihood and can often only be understood in that context (Gelman, Simpson, and Betancourt 2017).

Stacking, and the Open- $M$ Problem

Finally, traditional BMA is flawed in the sense that its validity rests upon the closed $M$ assumption implicit in equation 3. In the open $M$ setting where the true model is not among those in the set of candidate models under consideration, as will generally be the case in any real world application, estimated effects based on posterior probability weights are likely to be biased (Hollenbach and Montgomery 2020; Yao et al. 2018). This problem arises from the fact that marginal-likelihood based posterior probability weights for each model $m \in M$ correspond to the probability a given model captures the true data generation process and in most settings this is unlikely to ever be the case since, by definition, all models are wrong.

In the open $M$ setting, traditional BMA will still assign each model some probability but since by definition $\sum_{i = 1}^{m} Pr (M_{i} | y) = 1$ , these weights no longer have a valid interpretation in terms of the posterior probability a given model is true. Although this does not render marginal likelihood based model averaging useless and it may often make more sense to think about model specification and selection as a probabilistic process aimed at identifying the most likely model conditional on the set of models under consideration (Hinne et al. 2020), it is advisable to assess whether our substantive conclusions are sensitive to potential violations of the closed $M$ assumption. This once again underscores the fact that statistics is not a mechanistic truth generating process and there is no magic. It also underscores the need to think in more local terms–rather than asking is this model true we generally want to know the probability $X$ belongs to the true data generation process.⁴

An alternative but still properly Bayesian approach aimed at addressing these potential flaws in traditional BMA is cross-validation based stacking (Yao et al. 2018). In contrast to traditional BMA, stacking and pseudo-BMA weights are constructed based on the relative expectation of the posterior predictive density (Hollenbach and Montgomery 2020; Yao et al. 2018). These approaches to estimating model weights do not rely upon the closed $M$ assumption, and thus provide a way of either avoiding it altogether or relaxing it as a robustness check on the posterior probability weights, albeit while answering a fundamentally different question more appropriate for an open $M$ world. Since the overall approach to calculating model averaged AMEs does not change and we are merely substituting the posterior probability weights in equation 7 with stacking or pseudo-BMA weights, I direct readers to Yao et al. (2018) for a formal discussion of these approaches.

Female Combatants and Post-Conflict Rebel Parties

To illustrate the utility of BMAMEs in political science, I draw on an example adapted from the second chapter of my dissertation in which I examine how the active participation of women in rebel groups during wartime shapes the political calculus of former combatant groups at war’s end. In short, I expect rebel groups where women participate in combat roles during wartime to be more likely to form political parties and participate in post-conflict elections.

Data

To identify the universe of post-conflict successor parties to former combatant groups I rely on the Civil War Successor Party (CWSP) dataset which includes all civil wars in the UCDP Armed Conflicts Dataset (UCDP) that terminated between 1970 and 2015 (Daly 2020). Since the primary distinction between political parties and other civil society organizations such as interest groups lies in the latter’s participation in elections and fielding candidates for political office and a party that does not compete in elections is no party at all, I operationalize our dependent variable, Rebel to Party Transition, as a binary event which takes a value of 1 if a rebel group both forms a political party and directly participates in a country’s first post-conflict elections and 0 otherwise. The resulting sample consists of 112 observations for 108 unique rebel groups across 56 countries, approximately 66% of which formed parties and participated in their country’s first post-conflict election.

Following Wood and Thomas (2017), I define female combatants as members of a rebel group who are “armed and participate in organized combat activities on behalf of a rebel organization” (38). I code instances of female participation in rebel groups by mapping version 1.4 of the Women in Armed Rebellion Data (WARD) to the CWSP and construct two versions of the main predictor of interest based on the best estimate for the presence of female combatants in a given rebel group, one of which includes cases in which women served only as suicide bombers while the other codes these groups as not having any women in combat positions. Each of these variables takes a value of 1 if women participated in combat in a given rebel group and 0 otherwise. Among the rebel groups in our analysis, female combatants have been active in some capacity in 52.67% of those who were involved in civil wars that have terminated since 1970 and 50.89% featured women in non-suicide bombing roles.

I also adjust for several additional factors that may otherwise confound the estimate of the effect of female participation in combat on rebel successor party transitions. First, insurgent groups that attempt to perform functions typically characteristic of the state during wartime may be more likely to transition to political parties following the end of a conflict since, in doing so, they both develop organizational capacity and if successful, demonstrate that they are capable of governing to their future base of support. Moreover, as rebel groups build organizational capacity and accumulate institutional knowledge, their ability to sustain a successful insurgency and recruit combatants grows as well. To adjust for rebel organizational capacity I specify a Bayesian multidimensional item response model to construct a latent scale capturing the degree to which rebel groups engaged in functions of the state during wartime based on twenty-six items from Quasi-State Institutions dataset (Albert 2022).

Second, I adjust for baseline characteristics of each rebel group that previous research suggests influences both where women rebel and post-conflict party transitions. Drawing on data from Braithwaite and Cunningham (2019), I construct a series of dichotomous indicator variables for whether a rebel group is affiliated with an existing political party (Manning and Smith 2016), nationalist and left-wing or communist ideological origins (Wood and Thomas 2017), and whether a group has ties to a specific ethnic community since this may provide advantages in terms of mobilizing combatants and voters along specific cleavages. I also construct two additional indictators to capture whether the United Nations intervened in the conflict and whether a given rebel group was located in Africa.

While it is, of course, impossible in the context of an observational setting such as this one to isolate every possible source of confounding, nor is it generally advisable to attempt to do so, I focus on these baseline covariates since existing research provides a strong theoretical basis to suspect they impact the relationship of interest.⁵ For the purposes of this illustration, I assume this set of covariates is sufficient to satisfy strong ignorability but such an assumption is likely unreasonable and the particular analysis here may ultimately best be taken as descriptive in nature.

Code

# Load the required libraries
pacman::p_load(
  "tidyverse", 
  "arrow",
  "data.table",
  "brms", 
  "furrr", 
  "cmdstanr",
  "patchwork",
  "ggdist",
  "marginaleffects",
  install = FALSE
)

# Read in the pre-processed data
model_df <- read_parquet("data/model_data.gz.parquet")

Model and Priors

To estimate the effect of women’s participation in combat on post-conflict rebel party transitions for this example, I specify a simple logit model as shown in equation 9. The bernoulli probability $μ$ of a rebel group transitioning to a political party at war’s end is modeled as a function of the population-level intercept $α$ and a vector of coefficients $β_{k}$ corresponding to the matrix of inputs $X_{n}$ .

$\begin{matrix} (9) & \begin{aligned} y_{i} & \sim Bernoulli (μ) \\ logit (μ) & = α + X_{n} β_{k} \\ with priors \\ α & \sim Normal (0, 1) \\ β_{k} & \sim Normal (0, 0.75) \end{aligned} \end{matrix}$

Since we model $μ$ via a logit link function, our priors on the intercept $α$ and coefficients $β_{k}$ should correspond to the expected change in the log odds of the response for each predictor. Some may find the process of prior elicitation in logistic regression models difficult since reasoning on the logit scale may seem unintuitive. As figure 1 illustrates, the the log odds of the response for models with that employ a logit link function is effectively bounded between approximately $- 6$ and $6$ . Values in excess of $| 5 |$ generally should not occur unless your data exhibits near complete or quasi-complete separation.

Code

## Simulate logit probabilities
logit_scale <- tibble(
  x = seq(-6.5, 6.5, 0.0001),
  y = brms::inv_logit_scaled(x)
)

## Plot the logit scale
ggplot(logit_scale, aes(x = x, y = y)) +
  ## Add a line geom
  geom_line(size = 3, color = "firebrick") +
  ## Add labels
  labs(x = "Log Odds", y = "Probability") +
  ## Adjust the x axis scale breaks
  scale_x_continuous(breaks = seq(-6, 6, 1)) +
  ## Adjust the y axis scale breaks
  scale_y_continuous(breaks = seq(0, 1, 0.1))

Framed in terms of effect sizes, a change of $\pm 3$ in the log odds of the response relative to the baseline spans half the logit scale in either direction. In the social sciences where data is largely random noise and effect sizes are typically small to moderate at best and while the road to hell is paved with rules of thumb, we can generally assume a priori that 1) any true effect of a given parameter $β$ in the population is likely modest in magnitude and 2) values closer zero are more likely than those further away. From these relatively straightforward assumptions, we can arrive at a prior for the coefficients that follows a normal distribution with mean 0 and scale $0.75$ as shown in equation 9. In practical terms, this assumes the universe possible effect sizes for each predictor is normally distributed in the range of approximately $- 1.5$ and $1.5$ on the logit scale.

For the population-level intercept $α$ I assign a slightly more diffuse normal prior mean 0 and scale 1 or $α \sim Normal (0, 1)$ which reflects a slightly greater degree of uncertainty about the baseline probability but is still moderately informative and constrains the bulk of the prior density in the range of realistic values. It is important to note here that as shown in Stan model below, we are specifying a prior on the intercept for the centered inputs rather than on the original scale. To more explicitly communicate our prior assumptions, we can use the tools for visualizing uncertainty provided in the ggdist package as shown in figure 2 below (Kay 2022).

Code

# Simulate the prior probabilities
logit_priors <- tibble(
  alpha_prior = rnorm(10e5, 0, 1),
  beta_prior = rnorm(10e5, 0, 0.75)
) %>% 
  ## Pivot the simulated data to long form
  pivot_longer(everything())

# Set parameter names for the facets----
logit_prior_labels <- as_labeller(
  x = c(
    "beta_prior" = "bolditalic(beta[k] * phantom(.) * phantom() %~% phantom() * phantom(.) * Normal(0*list(,) * phantom(.) * 0.75))",
    "alpha_prior" = "bolditalic(alpha[k] * phantom(.) * phantom() %~% phantom() * phantom(.) * Normal(0*list(,) * phantom(.) * 1))"
  ), 
  default = label_parsed
)

# Plot the prior distributions
logit_prior_plots <- logit_priors %>% 
  # Initiate the ggplot object
  ggplot(aes(x = value, fill = name, shape = name)) +
  # Facet by the prior distributions
  facet_wrap(~ name, scales = "free", labeller = logit_prior_labels) +
  # Add a half eye geom for the prior distributions
  stat_halfeye(
    aes(slab_alpha = stat(pdf)),
    .width = c(.68, .95),
    fill_type = "gradient",
    show.legend = FALSE
  ) +
  # Manually set the fill colors
  scale_fill_manual(values = c("#208820", "#0060B0")) +
  # Manually set the shapes
  scale_shape_manual(values = c(22, 24)) +
  # Adjust the breaks on the y axis
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) +
  # Add labels to the plot
  labs(
    y = "Prior Probability Density",
    x = latex2exp::TeX(r'(Log $\Pr(\theta_{prior})$)', bold = T)
  )

# Plot back transformed probabilities
prob_trans_plots <- logit_priors %>% 
  # Initiate the ggplot object
  ggplot(aes(fill = name, shape = name)) +
  # Facet by the prior distributions
  facet_wrap(~ name, scales = "free", labeller = logit_prior_labels) +
  # Add a half eye geom for the prior distributions
  stat_halfeye(
    aes(x = plogis(value), slab_alpha = stat(pdf)),
    .width = c(.68, .95),
    fill_type = "gradient",
    show.legend = FALSE
  ) +
  # Manually set the fill colors
  scale_fill_manual(values = c("#208820", "#0060B0")) +
  # Manually set the shapes
  scale_shape_manual(values = c(22, 24)) +
  # Add labels to the plot
  labs(
    y = "Prior Probability Density",
    title = "",
    x = latex2exp::TeX(r'($\Pr(\theta_{prior})$)', bold = TRUE)
  ) +
  # Adjust the breaks on the x axis
  scale_x_continuous(breaks = scales::pretty_breaks(n = 8))

## Print the combined plot
logit_prior_plots/prob_trans_plots

Figure 2: Priors for the Logit Model Parameters

Since the model here is computationally quite efficient–on average, each model takes approximately 2.5 seconds to converge with 30,000 post-warmup draws–and in the interest of completeness, I search the full parameter space which consists of $2^{k}$ possible model specifications. Although it is possible to do this in this context, researchers will often instead need to place constrain the number of possible model configurations, generally by specifying an implicit prior model probability of 1 for certain parameters (Hinne et al. 2020; Montgomery and Nyhan 2010). Although brms provides an easy to use R interface to Stan that makes fitting a wide variety of Bayesian regression models easier than ever before and requires no direct knowledge of Stan, it is limited for my purposes here by the fact that it requires recompiling the underlying Stan model for each fit. This process ends up being extremely inefficient for a large number of model configurations since each model takes three times as long to compile as it does to fit.

To circumvent this issue, I implement the model directly in Stan in a way designed to be directly compatible with brms. For models containing the parameter of interest, the model implments the bayesian g-formula to estimate the average marginal effect of women’s participation in combat on the probability of a rebel-to-party transition during the estimation process in the generated quanitites block. As I previously demonstrated in this post, it is then possible to compile the Stan model once and use the same program for each dataset using {cmdstanr} directly. Further computational gains can be obtained by using the furrr package to fit multiple models in parallel and I specify a series of functions to streamline this process as shown below.

// Bayesian Logit Model with Optional Bootstrapped AMEs via the Bayesian g-formula
data {
  int<lower=1> N; // N Observations
  array[N] int Y; // Response Variable
  int<lower=1> K; // Number of Population-Level Effects
  matrix[N, K] X; // Population-level Design Matrix
  int prior_only; // Should the likelihood be ignored?
  int treat_pos; // Data for the bootstrapped AMEs
}

transformed data {

  int Kc = K - 1;
  matrix[N, Kc] Xc; // centered version of X without an intercept
  vector[Kc] means_X; // column means of X before centering

  for (i in 2 : K) {
    means_X[i - 1] = mean(X[ : , i]);
    Xc[ : , i - 1] = X[ : , i] - means_X[i - 1];
  }
  
  // Bootstrap Probabilities
  vector[N] boot_probs = rep_vector(1.0/N, N);
  
  // Matrices for the bootstrapped predictions
  matrix[N, Kc] Xa; 
  matrix[N, Kc] Xb;
    
  // Potential Outcome Y(X = 1, Z)
  Xa = X[, 2:K];
  Xa[, treat_pos - 1] = ones_vector(N);
    
  // Potential Outcome Y(X = 0, Z)
  Xb = X[, 2:K];
  Xb[, treat_pos - 1] = zeros_vector(N);
}

parameters {
  vector[Kc] b; // Regression Coefficients
  real Intercept; // Centered Intercept
}

transformed parameters {
  real lprior = 0; // Prior contributions to the log posterior
  lprior += normal_lpdf(b | 0, 0.75);
  lprior += normal_lpdf(Intercept | 0, 1);
}

model {
  // likelihood including constants
  if (!prior_only) {
    target += bernoulli_logit_glm_lpmf(Y | Xc, Intercept, b);
  }

  // priors including constants
  target += lprior;
}

generated quantities {

  // Actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);

  // Additionally sample draws from priors
  array[Kc] real prior_b = normal_rng(0, 0.75);
  real prior_Intercept = normal_rng(0, 1);
  
  // Row index to be sampled for bootstrap
  int row_i;
    
  // Calculate Average Marginal Effect in the Bootstrapped sample
  real AME = 0;
  array[N] real Y_X1; // Potential Outcome Y(X = 1, Z)
  array[N] real Y_X0; // Potential Outcome Y(X = 0, Z)
    
  for (n in 1:N) {
    // Sample the Baseline Covariates
    row_i = categorical_rng(boot_probs);
      
    // Sample Y(x) where x = 1 and x = 0
    Y_X1[n] = bernoulli_logit_rng(b_Intercept + Xa[row_i] * b);
    Y_X0[n] = bernoulli_logit_rng(b_Intercept + Xb[row_i] * b);
      
    // Add Contribution of the ith Observation to the Bootstrapped AME
    AME = AME + (Y_X1[n] - Y_X0[n])/N;
  }
    
  // Take the mean of the posterior expectations
  real EYX1 = mean(Y_X1); // E[Y | X = 1, Z]
  real EYX0 = mean(Y_X0); // E[Y | X = 0, Z]
}

// Bayesian Logit Model for Models without the Treatment Variable
data {
  int<lower=1> N; // total number of observations
  array[N] int Y; // response variable
  int<lower=1> K; // number of population-level effects
  matrix[N, K] X; // population-level design matrix
  int prior_only; // should the likelihood be ignored?
}

transformed data {

  int Kc = K - 1;
  matrix[N, Kc] Xc; // centered version of X without an intercept
  vector[Kc] means_X; // column means of X before centering

  for (i in 2 : K) {
    means_X[i - 1] = mean(X[ : , i]);
    Xc[ : , i - 1] = X[ : , i] - means_X[i - 1];
  }
}

parameters {
  vector[Kc] b; // Regression Coefficients
  real Intercept; // Centered Intercept
}

transformed parameters {
  real lprior = 0; // Prior contributions to the log posterior
  lprior += normal_lpdf(b | 0, 0.75);
  lprior += normal_lpdf(Intercept | 0, 1);
}

model {
  // likelihood including constants
  if (!prior_only) {
    target += bernoulli_logit_glm_lpmf(Y | Xc, Intercept, b);
  }

  // priors including constants
  target += lprior;
}

generated quantities {

  // Actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);

  // Additionally sample draws from priors
  array[Kc] real prior_b = normal_rng(0, 0.75);
  real prior_Intercept = normal_rng(0, 1);
}

#' A Function for building combinations of formulas for model averaging
#' 
#' @importFrom glue glue_collapse
#' @importFrom purrr map_chr
#' @importFrom stringr str_replace
#'
#' @param terms A named list of terms to create all unique combinations of.
#' See usage example for direction on how to specify each term
#' 
#' @param resp A string specifying the name of the response vector
#' 
#' @param base_terms An optional argument indicating additional terms to
#' include in all formulas. This can be used to place implicit constraints on 
#' the parameter space to reduce computational burden
#' 
#' @param ... Reserved for future development but currently
#' unused
#'
#' @return A character vector of model formulas containing all unique 
#' combinations of the elements in terms
#' 
#' @export bma_formulas
#' 
#' @examples
#' # Specify the input for the resp argument
#' response <- "fun"
#' 
#' # Specify the inputs for the terms argument
#' covariates <- list(
#'   age = c("", "age"),
#'   sex = c("", "sex"),
#'   rock = c("", "rock")
#'   )
#'   
#' # Build the formulas
#' model_forms <- bma_formulas(terms = covariates, resp = response)
#' 

bma_formulas <- function(terms, resp, base_terms = NULL, ...) {
  
  ## Validate the formula arguments
  stopifnot(exprs = {
    is.list(terms)
    is.character(resp)
  })
  
  ## Make a list of all unique combinations of lhs ~ rhs 
  formulas <- expand.grid(terms, stringsAsFactors = FALSE)
  
  ## Initialize a list to store the formulas in
  out <- list()
  
  ## Build a list of formulas for each pair formulas
  for (i in 1:nrow(formulas)) {
    out[[(i)]] <- glue_collapse(
      formulas[i, !is.na(formulas[i, ])], 
      sep = " + "
    )
  }
  
  # Paste the response term
  out <- paste(resp, "~", out)
  
  # If base_terms is non-null, add additional terms to all formulas
  if (!is.null(base_terms)) {
    out <- paste(out, base_terms, sep = " + ")
  }
  
  out <- map_chr(
    .x = out, 
    ~ str_replace(.x, "character\\(0\\)", "1")
    )
  
  return(out)
}

#' Fitting Models in Parallel with brms via cmdstanr
#'
#' @param formula A `brmsformula` object for the model.
#' 
#' @param treat A string argument containing the name of the treatment
#' or a regular expression to match against the columns in `data`.
#' 
#' @param prior A `brmsprior` object containing the priors for the model.
#' 
#' @param data A data frame containing the response and inputs named in
#' `formula`.
#' 
#' @param cores The maximum number of MCMC chains to run in parallel.
#' 
#' @param chains The number of Markov chains to run.
#' 
#' @param warmup The number of warmup iterations to run per chain.
#' 
#' @param sampling The number of post-warmup iterations to run per chain.
#' 
#' @param stan_models The compiled Stan program(s) to be passed to 
#' `cmdstanr::sample`.
#' 
#' @param save_file The file path where the brmsfit object should be saved.
#' 
#' @param ... Additional arguments passed to `brms::brm`
#'
#' @return An object of class `brmsfit`
#' 
#' @export cmdstan_logit
#'
cmdstan_logit <- function(formula,
                          treat,
                          prior,
                          data,
                          cores,
                          chains,
                          warmup,
                          sampling,
                          stan_models,
                          save_file,
                          ...) {
  # Check if the file already exists
  file_check <- file.exists(save_file)
  
  if (isTRUE(file_check)) {
    # Read in the file if it already exists
    out <- readr::read_rds(file = save_file)
  } else {
    
    # Total iterations for brms
    niter <- sampling + warmup
    
    # Initialize an empty brmsfit object
    brms_obj <- brm(
      formula = formula,
      family = brmsfamily("bernoulli", link = "logit"),
      prior = prior,
      data = data,
      cores = cores,
      chains = chains,
      iter = niter,
      warmup = warmup,
      empty = TRUE,
      seed = 123456,
      sample_prior = "yes",
      ...
    )
    
    # Get the data for the stan model
    stan_data <- standata(brms_obj)
    
    # Check whether the model contains the treatment
    stan_data$treat_pos <- .match_treat_args(stan_data, treat)
    
    # Select the Stan model to use based on the data
    if (!is.null(stan_data$treat_pos)) {
      cmdstan_model <- stan_models[["gformula"]]
    } else {
      cmdstan_model <- stan_models[["reference"]]
    }
    
    # Fit the brms model
    out <- .fit_cmdstan_to_brms(cmdstan_model, brms_obj, stan_data, 
                                cores, chains, warmup, sampling)
    
    # Write the brmsfit object to an rds file
    readr::write_rds(out, file = save_file, "xz", compression = 9L)
    
  }
  
  # Return the model object
  return(out)
}

# Function for detecting the treatment vector
.match_treat_args <- function(stan_data, treat) {
  
  # Check if any of the columns in X match the treatment
  check_treat <- grepl(treat, colnames(stan_data$X))
  
  # Get the position of the treatment if present
  if (any(check_treat)) {
    treat_pos <- which(check_treat)
  } else {
    treat_pos <- NULL
  }
  
  return(treat_pos)
}

# Function to fit the models with cmdstanr and store them in the empty brmsfit
.fit_cmdstan_to_brms <- function(cmdstan_model,
                                 brms_obj,
                                 stan_data,
                                 cores,
                                 chains,
                                 warmup,
                                 sampling) {
  
  # Set the initial number of draws to 0
  min_draws <- 0
  
  # Repeat the run if any of the chains stop unexpectedly
  while (min_draws < (sampling * chains)) {
    
    # Fit the Stan Model
    fit <- cmdstan_model$sample(
      data = stan_data,
      seed = 123456,
      refresh = 1000,
      sig_figs = 5,
      parallel_chains = cores,
      chains = chains,
      iter_warmup = warmup,
      iter_sampling = sampling,
      max_treedepth = 13
    )
    
    # Update the check
    min_draws <- posterior::ndraws(fit$draws())
  }
  
  # Convert the environment to a stanfit object and add it to the brmsfit
  brms_obj$fit <- rstan::read_stan_csv(fit$output_files())
  
  # Rename the parameters
  brms_obj <- brms::rename_pars(brms_obj)
  
  # Store the expectations inside the brmsfit object
  if (!is.null(stan_data$treat_pos)) {
    
    # Extract the expectations and contrasts
    brms_obj$predictions <- fit$draws(
      variables = c("EYX0", "EYX1", "AME"), 
      format = "draws_df"
    )
  }
  
  # Store run times
  brms_obj$times <- fit$time()
  
  # Return the brmsfit object
  return(brms_obj)
}

We can then generate a formula for each of the $2^{k}$ model combinations, specify priors, and compile the Stan programs to be passed to cmdstan_logit as shown below. I estimate each of the 384 possible candidate models, running six parallel markov chains for iterations 8,000 each and discarding the first 3,000 after the initial warmup adaptation stage.⁶ This leaves 30,000 post-warmup draws per model for post-estimation inference. While this is well in excess of what is required for reliable estimation, the large number of samples is necessary to ensure stability of the bridge sampling approximation for the marginal likelihood (Schad et al. 2022). Fitting three models in parallel via furrr, estimation and approximate leave one out cross validation take less than 10 minutes in total to complete. Oweing in part to the fact I run the algorithm 250 times per model, however, the run time to estimate the marginal likelihood via bridge sampling takes roughly 9 hours in total.

# Define the list of predictors, parameter space is 2^k
model_terms <- list(
  femrebels = c("femrebels", "femrebels_exs", NA),
  left_ideol = c("left_ideol", NA),
  nationalist = c("nationalist", NA),
  un_intervention = c("un_intervention", NA),
  party_affil = c("party_affil", NA),
  ethnic_group = c("ethnic_group", NA),
  africa = c("africa", NA),
  qsi = c("qsi_pol + qsi_econ + qsi_social", NA)
)

# Build the model formulas, results in 383 combinations
model_forms <- bma_formulas(terms = model_terms, resp = "elect_party") %>% 
  map(.x = ., ~ bf(.x))

# Specify priors for the model parameters
logit_priors <- prior(normal(0, 0.75), class = b) +
  prior(normal(0, 1), class = Intercept)

# Names of the files to write the models to
fits_files <- str_c(
  main_priors_dir,
  "Party_Transition_Logit_Main_",
  seq_along(model_forms),
  ".rds"
)

# Compile the Stan models
bayes_logit_mods <- list(
  "reference" = cmdstan_model(
    stan_file = str_c(
      main_priors_dir, 
      "Party_Transition_Logit_Main.stan"
      ), 
    force_recompile = FALSE
  ),
  "gformula" = cmdstan_model(
    stan_file = str_c(
      main_priors_dir, 
      "Party_Transition_Logit_Main_gformula.stan"
      ), 
    force_recompile = FALSE
  )
)

# Planning for the future
plan(tweak(multisession, workers = 3))

# Fit the models in parallel via {future}
bayes_model_fits <- future_map2(
  .x = model_forms,
  .y = fits_files,
  .f = ~ cmdstan_logit(
    formula = .x,
    prior = logit_priors,
    treat = "femrebels|femrebels_exs",
    data = model_df,
    cores = 6,
    chains = 6,
    warmup = 3e3,
    sampling = 5e3,
    stan_models = bayes_logit_mods,
    save_file = .y,
    control = list(max_treedepth = 13),
    save_pars = save_pars(all = TRUE)
  ),
  .options = furrr_options(
    scheduling = 1,
    seed = TRUE,
    prefix = "prefix"
  ),
  .progress = TRUE
) # Takes about ~5 minutes in total

# Add PSIS LOO CV Criteria
bayes_model_fits <- future_map2(
  .x = bayes_model_fits,
  .y = fits_files,
  .f = ~ add_criterion(
    .x,
    criterion = "loo",
    cores = 4,
    file = stringr::str_remove(.y, ".rds")
  ),
  .options = furrr_options(
    scheduling = 1,
    seed = TRUE,
    prefix = "prefix"
  ),
  .progress = TRUE
) # Takes about ~4 minutes in total

# Add log marginal likelihood----
bayes_model_fits <- future_map(
  .x = bayes_model_fits,
  .f = ~ add_criterion(
    .x,
    criterion = "marglik",
    max_iter = 5e3,
    repetitions = 250,
    cores = 8,
    silent = TRUE
  ),
  .options = furrr_options(
    scheduling = 1,
    seed = TRUE,
    prefix = "prefix"
  ),
  .progress = TRUE
) # Takes about ~9.5 hours in total

# Back from the future
plan(sequential)

Hypothesis Testing

In a Bayesian framework, model comparison is formalized by means of posterior model probabilities as shown in equation 3. Given $m$ possible candidate models and letting $i$ denote those containing the parameter of interest and $k$ those that do not, we can obtain posterior inclusion and exclusion probabilities for a given parameter. The ratio of these two probabilities shown in equation 10 is an inclusion Bayes Factor and represents the relative posterior odds of models containing the parameter of interest compared to all of the candidate models that do not.

$\begin{matrix} (10) & {BF}_{Inclusion} = \frac{\sum_{i = 1} Pr (M_{i} | y)}{\sum_{k = 1} Pr (M_{k} | y)} \end{matrix}$

In terms of hypothesis testing, posterior inclusion probabilities provide the epistemologically correct way to answer the questions most social science studies are concerned with–given our modeling assumptions, how likely are those models containing the treatment $X$ compared to those that do not. In other words, a posterior inclusion probability provides a continuous measure of evidence we can use to evaluate how probable it is our treatment or intervention of interest belongs to the true data generation process (Hinne et al. 2020).⁷

Although my interest here lies primarily in using the posterior model probabilities to obtain a model-wise mixture distribution of the substantive effect estimates, I breifly illustrate how one can formally test a hypothesis in a Bayesian framework and obtain a vague answer to a precise question. This may be useful in a sequential testing approach where we first wish to evaluate whether an effect exists before preceding to interpretation though one may also simply assume effects exist and focus on estimating their magnitude instead.

For the purposes of this illustration, I compare those models containing each of the two measures of women’s participation in combat during civil wars to each of the possible model specifications that does not include either of the two measures. Although by convention, it is common to take the prior model probabilities to be equally likely–in effect an implicit uniform prior–this is far from the only possible approach and in many cases, particularly when adjudicating between competing theories, it may make more sense to weight the hypotheses according to some external source such as existing research on the subject, meta-analyses, etc. For simplicity here, I assume each of the models is equally likely a priori and obtain posterior probabilities and inclusion Bayes Factors for the parameters of interest as shown below.⁸

# Extract the parameters of interest in each model
model_params <- map(
  bayes_model_fits, 
  ~ variables(.x) %>% 
    str_subset(., "^b_")
)

# Build a reference data frame with the positions of the predictors
params_pos_df <- tibble(
  # Model Names
  model = str_c("M", 1:length(model_params)),
  # Female Rebels (Including Suicide Bombers)
  femrebels_pos = map_lgl(
    model_params,
    ~ str_which(.x, "b_femrebels$") %>% 
      length() == 1
  ),
  # Female Rebels (Excluding Suicide Bombers)
  femrebelsexs_pos = map_lgl(
    model_params, 
    ~ str_which(.x, "b_femrebels_exs") %>% 
      length() == 1
  ),
  # No Female Rebels Predictor
  reference_pos = map_lgl(
    model_params, 
    ~ str_which(.x, "b_femrebel*") %>% 
      length() %>% 
      !.
  ))

# Apply names to the model fits list
names(bayes_model_fits) <- str_c("M", 1:length(bayes_model_fits))

# Extract the marginal likelihood object from each model
main_marglik <- map(bayes_model_fits, ~ .x$criteria$marglik)

# Posterior Probabilities for female rebel models
femrebels_marglik <- with(
  params_pos_df,
  main_marglik[c(
    which(femrebels_pos == TRUE),
    which(reference_pos == TRUE)
  )]
)

# Matrix of Relative Posterior Probabilities
femrebels_post_probs <- do_call(
  "post_prob", 
  args = unname(femrebels_marglik), 
  pkg = "bridgesampling"
)

# Apply model names to the matrix
colnames(femrebels_post_probs) <- names(femrebels_marglik) 

# Build a tibble for the comparisons
femrebels_vs_null_df <- tibble(
  prob_femrebels = rowSums(femrebels_post_probs[, 1:128]),
  prob_null = rowSums(femrebels_post_probs[, 129:255]),
  Inclusion_BF = prob_femrebels/prob_null
)

# Posterior Probabilities for female rebel models (excluding SBs)
femrebels_exs_marglik <- with(
  params_pos_df,
  main_marglik[c(
    which(femrebelsexs_pos == TRUE),
    which(reference_pos == TRUE)
  )]
)

# Matrix of Relative Posterior Probabilities
femrebels_exs_post_probs <- do_call(
  "post_prob", 
  args = unname(femrebels_exs_marglik), 
  pkg = "bridgesampling"
)

# Apply model names to the matrix
colnames(femrebels_exs_post_probs) <- names(femrebels_exs_marglik) 

# Build a tibble for the comparisons
femrebels_exs_vs_null_df <- tibble(
  prob_femrebels_exs = rowSums(femrebels_exs_post_probs[, 1:128]),
  prob_null = rowSums(femrebels_exs_post_probs[, 129:255]),
  Inclusion_BF = prob_femrebels_exs/prob_null
)

Code

# Fancy ggplots because aesthetics
bf_femrebels_vs_null <- ggplot(
  data = femrebels_vs_null_df, 
  aes(x = Inclusion_BF)
) +
  # Slab stat for the raincloud plot
  stat_slab(
    aes(slab_alpha = stat(pdf)),
    slab_color = "#1BD0D5FF",
    fill_type = "segments",
    fill = "#2EF19CFF",
    show.legend = FALSE,
    scale = 1
  ) +
  # Dot interval stat for the raincloud plot
  stat_dotsinterval(
    point_interval = "median_qi",
    fill = "#99FE42FF",
    shape = 23,
    .width = c(0.68, 0.90),
    point_size = 6,
    stroke = 2,
    slab_color = "#1BD0D5FF",
    show.legend = FALSE,
    side = "bottom",
    layout = "weave",
    dotsize = 0.75
  ) +
  # Add axis labels
  labs(
    title = "Female Combatants vs. Null",
    x = parse(text = "bold(Inclusion~BF[ij])"),
    y = ""
  ) +
  # Adjust the breaks on the x axis
  scale_y_continuous(breaks = NULL) +
  # Adjust the breaks on the y axis
  scale_x_continuous(breaks = scales::pretty_breaks(n = 6))

# Fancy ggplots because aesthetics
bf_femrebels_exs_vs_null <- ggplot(
  data = femrebels_exs_vs_null_df, 
  aes(x = Inclusion_BF)
) +
  # Slab stat for the raincloud plot
  stat_slab(
    aes(slab_alpha = stat(pdf)),
    slab_color = "#AB82FF",
    fill_type = "segments",
    fill = "#9400D3",
    show.legend = FALSE,
    scale = 1
  ) +
  # Dot interval stat for the raincloud plot
  stat_dotsinterval(
    point_interval = "median_qi",
    fill = "#B23AEE",
    shape = 21,
    .width = c(0.68, 0.90),
    point_size = 6,
    stroke = 2,
    slab_color = "#AB82FF",
    show.legend = FALSE,
    side = "bottom",
    layout = "weave",
    dotsize = 0.75
  ) +
  # Add axis labels
  labs(
    title = "Female Combatants (Excluding SBs) vs. Null",
    x = parse(text = "bold(Inclusion~BF[ij])"),
    y = ""
  ) +
  # Adjust the breaks on the x axis
  scale_y_continuous(breaks = NULL) +
  # Adjust the breaks on the y axis
  scale_x_continuous(breaks = scales::pretty_breaks(n = 6))

bf_femrebels_vs_null|bf_femrebels_exs_vs_null

Figure 3: Inclusion Bayes Factors for Female Combatants

Given the prior assumptions about the universe of effect sizes possible and prior probabilities for the hypothesis that rebel groups in which women participate in combat during wartime are more likely to transition to political parties at wars end, the posterior inclusion probability for female combatants is approximately 0.841 when we include cases in which women only participated as suicide bombers and 0.936 if we code participation only in cases where women served in a combat capacity that was not explicitly limited to suicide bombing. Stated differently, as the inclusion Bayes Factors in figure 3 illustrate, the observed data is between 5.29 and 14.64 times more likely to have arisen from those models that include a predictor for female combatants depending on whether we consider cases in which women only participated in combat as suicide bombers or not. This suggests a resonable degree of evidence in support the of the expectation that women’s participation in combat has a small to moderate effect on the probability of rebel to party transitions.

Model Averaged Marginal Effects

Turning to the substantive effect estimate, I illustrate here three different approaches to estimating model averaged marginal effects. When fitting each of the models in the Stan program above, we applied Bayesian g-formula to estimate the population average treatment effect in each of the models containing the female combatants predictor and we can retreive those directly from the brmsfit objects. The model_averaged_ame function shown below implements the approach to averaging over a matrix of posterior probability weights outlined in equation 8 along with the simpler approach from equation 7.

#' Estimating Model Averaged Marginal Effects and Contrasts for Bayesian Models
#' 
#' This function facilitates the estimation of model averaged and stacked 
#' marginal effects for Bayesian models and is designed to account for uncertainty 
#' in the process of model specification and selection when estimating average 
#' marginal effects and probability contrasts.
#'
#' @aliases model_averaged_ame
#' 
#' @import data.table
#'
#' @param x A datatable of posterior draws for the average marginal effects
#' or predictions such as that returned by the `posterior_contrasts` function.
#'
#' @param weights An \eqn{n \times m} matrix of posterior probability weights 
#' such as that returned by `bridgesampling::post_prob` or an m length vector
#' of stacking or pseudo-BMA weights.
#' 
#' @param summary A logical argument indicating whether to return the full 
#' draws for each row in `weights` or return the average for each model-draw 
#' pair. Defaults to `FALSE`
#
#' @param ... Additional arguments for future development, currently unused.
#'
#' @return A datatable containing a weighted average of the posterior draws 
#' for the model averaged marginal effects.
#' 
#' @export model_averaged_ame
#' 
model_averaged_ame <- function(x, weights, ndraws, summary = TRUE, ...) {
  
  # Nest the draws data frame
  draws <- x[, list(x.yz = list(.SD)), by=model]
  
  # If weights are a matrix, calculate the draws by row
  if (is.matrix(weights)) {
    
    # Construct a matrix of draw weights
    weighted_draws <- apply(
      weights, 
      MARGIN = 1, 
      FUN = function(x) {
        brms:::round_largest_remainder(x * ndraws)
      })
    
    # Initialize a list to store the draws in
    out <- list()
    
    # Loop over each column in the matrix
    for (i in 1:ncol(weighted_draws)) {
      
      # Randomly sample n rows for each model based on the weights
      draw_ids <- lapply(weighted_draws[, i], function(x){
        sample(sum(weighted_draws[,i]), size = x)
      })
      
      # Randomly sample n draws proportional to the weights
      out[[i]] <- lapply(seq_along(draws[, x.yz]), function(id) {
        draws[, x.yz[[id]][draw_ids[[id]]]]
      })
      
      # Bind the draws into a single data table per repetition
      out[[i]] <- rbindlist(out[[i]], idcol = "model")
    }
    
    # Combine everything into a single data table
    out <- rbindlist(out, idcol = "bridge_rep")
    
    # Average over the bridge sampling repetitions by draw
    if (isTRUE(summary)) {
      out = out[
        , keyby = .(.draw, model),
        lapply(.SD, mean),
        .SDcols = patterns("EY|AME")
      ]
    }
    
  } else if (is.vector(weights)) {
    # Construct a vector of draw weights
    weighted_draws <- brms:::round_largest_remainder(weights * ndraws)
    
    # Randomly sample n rows for each model based on the weights
    draw_ids <- lapply(weighted_draws, function(x){
      sample(sum(weighted_draws), size = x)
    })
    
    # Randomly sample n draws proportional to the weights
    out <- lapply(seq_along(draws[, x.yz]), function(id) {
      draws[, x.yz[[id]][draw_ids[[id]]]]
    })
    
    # Combine everything into a single data table
    out <- rbindlist(out, idcol = "model")
  }
  
  # Return the model averaged draws
  return(out)
  
}

# Set the seed here for reproducibility
set.seed(2023)

# Extract the AMEs from the models containing femrebels
femrebels_ame_boot <- map(
  .x = with(
    params_pos_df, 
    bayes_model_fits[which(femrebels_pos == TRUE)]
    ),
  ~ .x$predictions
  ) %>%
  # Append the list of data tables into one
  rbindlist(., idcol = "model")

# Posterior Probabilities for female rebel models
femrebels_bma_marglik <- with(
  params_pos_df,
  main_marglik[which(femrebels_pos == TRUE)]
)

# Matrix of Relative Posterior Probabilities for BMA
femrebels_bma_post_probs <- do_call(
  "post_prob", 
  args = unname(femrebels_bma_marglik), 
  pkg = "bridgesampling"
)

# Apply model names to the matrix
colnames(femrebels_bma_post_probs) <- names(femrebels_bma_marglik) 

# Calculate the weighted draws for each model
femrebels_bmame <- model_averaged_ame(
  femrebels_ame_boot, 
  weights = femrebels_bma_post_probs,
  ndraws = 30e3,
  summary = TRUE
  )

# Print the first few rows
head(femrebels_bmame)

     .draw model    EYX0    EYX1      AME
  1:     1     2 0.74107 0.78571 0.044643
  2:     1     3 0.59821 0.79464 0.196430
  3:     1     4 0.58929 0.82143 0.232140
  4:     1     5 0.57143 0.81250 0.241070
  5:     1     9 0.54464 0.71429 0.169640
  6:     1    11 0.58929 0.83036 0.241070

# Extract the AMEs from the models containing femrebels (excluding SBs)
femrebels_exs_ame_boot <- map(
  .x = with(
    params_pos_df, 
    bayes_model_fits[which(femrebelsexs_pos == TRUE)]
    ),
  ~ .x$predictions
  ) %>%
  # Append the list of data tables into one
  rbindlist(., idcol = "model")

# Posterior Probabilities for female rebel (excluding SBs) models
femrebels_exs_bma_marglik <- with(
  params_pos_df,
  main_marglik[which(femrebelsexs_pos == TRUE)]
)

# Matrix of Relative Posterior Probabilities for BMA
femrebels_exs_bma_post_probs <- do_call(
  "post_prob", 
  args = unname(femrebels_exs_bma_marglik), 
  pkg = "bridgesampling"
)

# Apply model names to the matrix
colnames(femrebels_exs_bma_post_probs) <- names(femrebels_exs_bma_marglik) 

# Calculate the weighted draws for each model
femrebels_exs_bmame <- model_averaged_ame(
  femrebels_exs_ame_boot, 
  weights = femrebels_exs_bma_post_probs,
  ndraws = 30e3,
  summary = TRUE
  )

# Print the first few rows
head(femrebels_exs_bmame)

     .draw model    EYX0    EYX1     AME
  1:     1     1 0.48214 0.63393 0.15179
  2:     1     3 0.39286 0.77679 0.38393
  3:     1     4 0.39286 0.75000 0.35714
  4:     1     6 0.56250 0.78571 0.22321
  5:     1     7 0.49107 0.75000 0.25893
  6:     1     8 0.48214 0.66071 0.17857

We can then plot the posterior distribution of the BMAMEs using ggdist as shown below which allows us to nicely depict the uncertainty in the estimates. As 4 illustrates, the model averaged marginal effect estimates for the expected difference in the probability of former combatant group transitioning to a post-conflict political party is positive for both measures. When considering only cases where women served in a combat capacity that was not explicitly limited to suicide bombing, we see a 98.38% probability of the effect being positive and in the expected direction with a posterior median of 0.2232 and 89% credible interval spanning the range 0.0625 and 0.3750.

The results are largely identical for the other measure that includes those cases in which female combatants only served as suicide bombers as instances of participation, with a 96.32% probability of the effect being positive and a posterior median of of 0.1875 and an 89% probability the parameter value falls between 0.0268 and 0.3482. Overall, the general conclusion here is rebel groups where women particpate in combat roles during wartime are more likely to form parties and compete in post-conflict elections.

Code

# Append these into a single dataframe for plotting
bmame_boot_ests <- rbindlist(
  list(
    "Female Combtants (Excluding SBs)" = femrebels_exs_bmame, 
    "Female Combtants (Including SBs)" = femrebels_bmame
    ), 
  idcol = "plot_row"
  )

# Initialize the ggplot2 object for the model averaged AMEs
ggplot(bmame_boot_ests, aes(y = AME, x = plot_row)) +
  # Violin plot looking things
  stat_eye(
    aes(
      slab_alpha = after_stat(pdf), 
      slab_fill = stat(y > 0), 
      point_fill = plot_row, 
      shape = plot_row
    ),
    slab_size = 2,
    fill_type = "gradient",
    point_interval = "median_qi",
    .width = c(0.68, 0.89),
    point_size = 6,
    stroke = 2,
    show.legend = FALSE, 
    scale = 0.7
  ) +
  # Adjust the y axis scale
  scale_y_continuous(
    breaks = scales::pretty_breaks(n = 10),
    limits = c(-0.4, 0.65)
    ) +
  # Adjust fill scale
  scale_fill_manual(
    values = c("#23C3E4FF", "#C82803FF"), 
    aesthetics = "slab_fill"
  ) +
  # Adjust fill scale
  scale_fill_manual(
    values = c("#3AA2FCFF", "#FD8A26FF"), 
    aesthetics = "point_fill"
  ) +
  # Adjust shape scale
  scale_shape_manual(values = c(22, 21)) +
  # Add labels to the plot
  labs(
    y = parse(text = "bold('Model Averaged AME '*Delta)"),
    x = ""
    )

Figure 4: Model Averaged Marginal Effects and Expectations for the Effect of Female Combatants on the Probability of Post-Conflict Party Transitions

Although this is the ideal approach since it accounts for the additional source of computational uncertainty introduced by approximating the marginal likelihood, it can be expensive in terms of both computational time and memory requirements in addition to requiring a minimal knowledge of Stan. We can sidestep Stan and focus exclusively on brms by defining a simple R function to calculate contrasts and model averaged marginal effects as shown below. As figure 5 illustrates, the estimates here are virtually identical and do not differ substantively from those in figure 4. This estimation process is reasonably straightforward and can be used for any model supported by brms.

#' Posterior Contrasts for Bayesian Models fit with {brms}
#'
#' @param models A named list of model objects of class `brmsfit` for which to 
#' generate predictions.
#' 
#' @param X A string indicating the name of the treatment in the models.
#' 
#' @param data Data used to fit the model objects.
#' 
#' @param contrasts Values of `X` at which to generate predictions. Defaults to
#' `c(0, 1)` but could also be set to factor levels.
#' 
#' @param cores Number of cores to use for parallel computation. If cores > 1,
#' future::multisession is used to speed computation via `{furrr}`. Requires
#' the `{furrr}` package.
#' 
#' @param ... Additional arguments passed down to `brms::posterior_predict`
#'
#' @return A list of posterior contrasts for each model in `models`
#' 
#' @export posterior_contrasts
#'
posterior_contrasts <- function(models, 
                                X, 
                                data,
                                contrasts = c(0, 1),
                                cores = 1,
                                ...) {
  
  # Check that all elements of models are of class brmsfit
  for (i in seq_along(models)) {
    brms_check <- brms::is.brmsfit(models[[i]])
    stopifnot(
      "All elements in models must be of class brmsfit" = brms_check, 
      brms_check
      )
  }
  
  # Y_i(X = 0, Z)
  lo <- data |> mutate("{X}" := contrasts[1])
  
  # Y_i(X = 1, Z)
  hi <- data |> mutate("{X}" := contrasts[2])
  
  # Check if cores > 1
  if (cores > 1) {
    
    # Requires {furrr}
    require(furrr)
    
    # Fit models in parallel via future
    plan(tweak(multisession, workers = cores))
    
    # Posterior predictions for each data set
    predictions <- furrr::future_map(
      .x = models,
      ~ .get_posterior_predictions(hi, lo, model = .x, ...),
      .options = furrr::furrr_options(
        scheduling = 1,
        seed = TRUE,
        prefix = "prefix"
      ),
      .progress = TRUE
    )
    
    # Close the future session
    plan(sequential)
  }
  
  if (cores == 1) {
    # Posterior predictions for each data set
    predictions <- purrr::map(
      .x = models,
      ~ .get_posterior_predictions(hi, lo, model = .x, ...)
    )
  }
  
  # Return the list of predictions
  return(predictions)
}

# Helper function to get posterior prediction contrasts
.get_posterior_predictions <- function(hi, lo, model, ...) {
  
  # Predictions for Y_i(X = 1, Z)
  EYX1 <- brms::posterior_predict(model, newdata = hi, ...)
  
  # Predictions for Y_i(X = 0, Z)
  EYX0 <- brms::posterior_predict(model, newdata = lo, ...)
  
  # Average over the observations for each draw in the prediction matrix
  out <- data.table(
    EYX1 = rowMeans(EYX1),
    EYX0 = rowMeans(EYX0),
    AME = rowMeans(EYX1 - EYX0),
    .draw = 1:ndraws(model)
  )
  
  # Return just the average contrast
  return(out)
}

# Models containing femrebels
femrebels_mods <- with(
  params_pos_df,
  bayes_model_fits[which(femrebels_pos == TRUE)]
)

# Generate posterior expectations for the AMEs
femrebels_ame_draws <- posterior_contrasts(
  models = femrebels_mods, 
  X = "femrebels", 
  data = femrebels_mods[[1]]$data,
  contrasts = c(0, 1),
  cores = 4
)

# Calculate the weighted draws for each model
femrebels_bmame_draws <- model_averaged_ame(
  femrebels_ame_draws, 
  weights = femrebels_bma_post_probs,
  ndraws = 30e3,
  summary = TRUE
  )

# Print the first few rows
head(femrebels_bmame_draws)

     .draw model      EYX1      EYX0       AME
  1:     1     2 0.8214286 0.6160714 0.2053571
  2:     1     3 0.7678571 0.5000000 0.2678571
  3:     1     4 0.7857143 0.4375000 0.3482143
  4:     1     5 0.7857143 0.5803571 0.2053571
  5:     1     7 0.6428571 0.5803571 0.0625000
  6:     1     9 0.7946429 0.6339286 0.1607143

# Models containing femrebels
femrebels_exs_mods <- with(
  params_pos_df,
  bayes_model_fits[which(femrebelsexs_pos == TRUE)]
)

# Generate posterior expectations for the AMEs
femrebels_exs_ame_draws <- posterior_contrasts(
  models = femrebels_exs_mods, 
  X = "femrebels_exs", 
  data = femrebel_exs_mods$M112$data,
  contrasts = c(0, 1),
  cores = 4
)

# Calculate the weighted draws for each model
femrebels_exs_bmame_draws <- model_averaged_ame(
  femrebels_exs_ame_draws, 
  weights = femrebels_exs_bma_post_probs,
  ndraws = 30e3,
  summary = TRUE
  )

# Print the first few rows
head(femrebels_exs_bmame_draws)

     .draw model      EYX1      EYX0       AME
  1:     1     1 0.6875000 0.5446429 0.1428571
  2:     1     2 0.7142857 0.4642857 0.2500000
  3:     1     3 0.8035714 0.4464286 0.3571429
  4:     1     4 0.7321429 0.4464286 0.2857143
  5:     1     5 0.8839286 0.6250000 0.2589286
  6:     1     7 0.6785714 0.4464286 0.2321429

Code

femrebels_bmame_plot_alt <- femrebels_bmame_draws |>
  # Initialize the ggplot2 object for the model averaged AME
  ggplot(aes(x = AME)) +
  # Add a slab interval to plot the weighted posterior
  stat_halfeye(
    aes(
      slab_alpha = after_stat(pdf),
      slab_fill = stat(x > 0)
      ),
    slab_size = 2,
    slab_colour = "#1BD0D5FF",
    shape = 22,
    fill_type = "gradient",
    point_interval = "median_qi",
    .width = c(0.68, 0.89),
    point_size = 6,
    stroke = 2,
    point_fill = "#99FE42FF",
    show.legend = FALSE
  ) +
  # Add labels to the plot
  labs(
    x = parse(text = "bold('Model Averaged AME '*Delta)"),
    y = "Posterior Density",
    title = "Female Combtants (Including Suicide Bomber Roles)"
  ) +
  # Adjust fill scale
  scale_fill_manual(
    values = c("#208820", "#0060B0"), 
    aesthetics = "slab_fill"
  ) +
  # Adjust the x axis scale
  scale_x_continuous(
    breaks = scales::pretty_breaks(n = 10),
    limits = c(-0.4, 0.68)
    ) +
  # Adjust the y axis scale
  scale_y_continuous(breaks = scales::pretty_breaks(n = 6))

femrebels_exs_bmame_plot_alt <- femrebels_exs_bmame_draws |>
  # Initialize the ggplot2 object for the model averaged AME
  ggplot(aes(x = AME)) +
  # Add a slab interval to plot the weighted posterior
  stat_halfeye(
    aes(
      slab_alpha = after_stat(pdf), 
      slab_fill = stat(x > 0)
      ),
    slab_size = 2,
    slab_colour = "#1BD0D5FF",
    shape = 22,
    fill_type = "gradient",
    point_interval = "median_qi",
    .width = c(0.68, 0.89),
    point_size = 6,
    stroke = 2,
    point_fill = "#99FE42FF",
    show.legend = FALSE
  ) +
  # Add labels to the plot
  labs(
    x = parse(text = "bold('Model Averaged AME '*Delta)"),
    y = "Posterior Density",
    title = "Female Combtants (Excluding Suicide Bomber Roles)"
  ) +
  # Adjust fill scale
  scale_fill_manual(
    values = c("#208820", "#0060B0"), 
    aesthetics = "slab_fill"
  ) +
  # Adjust the x axis scale
  scale_x_continuous(
    breaks = scales::pretty_breaks(n = 10),
    limits = c(-0.4, 0.68)
    ) +
  # Adjust the y axis scale
  scale_y_continuous(breaks = scales::pretty_breaks(n = 6))

(femrebels_bmame_plot_alt / femrebels_exs_bmame_plot_alt)

Figure 5: Model Averaged Marginal Effects and Expectations for the Effect of Female Combatants on the Probability of Post-Conflict Party Transitions, Direct Estimation

Illustration in {marginaleffects}

Finally, in the interest of encouraging applied researchers to take uncertainty seriously and integrate sensitivity analysis into their work directly rather than treating it as something to undertake only when forced, I turn now to a straightforward illustration of BMAMEs using the marginaleffects package which provides the necessary functionality to handle most of the above steps automatically thanks to the feature-rich support for various approaches to averaging across posterior distributions provided by brms.

To obtain the BMAME for a given parameter while accounting for the uncertainty in the model specifications, version 0.5.0 and higher of marginaleffects allows users to specify the argument type = "average" in their call to either marginaleffects::marginaleffects or marginaleffects::comparisons for objects of class brmsfit along with any additional arguments to be passed down to pp_average such as the type of weights to estimate, or alternatively a numeric vector of pre-estimated weights which is usually the more computationally efficient option and the approach I take in the example below.

Although up to this point I have been averaging over all possible combinations of models containing the parameter of interest, we will take a step back here and focus on the full model specifications, averaging over two models that differ only in which of the female combatants measures they contain. I begin by specifying and estimating each of the models using brms.

# Prepare the data, we need the two variables to be named the same
model_data <- list(
  "femrebels" = model_df %>% 
    mutate(femrebel = femrebels),
  "femrebels_exs" = model_df %>% 
    mutate(femrebel = femrebels_exs)
)

# Model formula for the female rebels models
femrebel_bf <- bf(
  elect_party ~ femrebels + left_ideol + nationalist + un_intervention + 
    party_affil + ethnic_group + un_intervention + africa + qsi_pol + 
    qsi_econ + qsi_social
  )

# Specify priors for the model parameters
femrebel_logit_priors <- prior(normal(0, 0.75), class = b) +
  prior(normal(0, 1), class = Intercept)

# Define the custom stanvars for the bootstrapped g-formula for comparison purposes
gformula_vars <- stanvar(
  x = 1,
  name = "treat_pos",
  block = "data",
  scode = "int<lower = 1, upper = K> treat_pos; // Treatment Position"
) + 
  # Code for the transformed data block
  stanvar(
    scode = "
  // Bootstrap Probabilities
  vector[N] boot_probs = rep_vector(1.0/N, N);
  
  // Matrices for the bootstrapped predictions
  matrix[N, Kc] Xa; 
  matrix[N, Kc] Xb;
    
  // Potential Outcome Y(X = 1, Z)
  Xa = X[, 2:K];
  Xa[, treat_pos] = ones_vector(N);
    
  // Potential Outcome Y(X = 0, Z)
  Xb = X[, 2:K];
  Xb[, treat_pos] = zeros_vector(N);
    ",
  position = "end",
  block = "tdata"
  ) +
  # Code for the bootstrapped g-formula in generated quantities
  stanvar(
    scode = "
  // Row index to be sampled for bootstrap
  int row_i;
    
  // Calculate Average Marginal Effect in the Bootstrapped sample
  real AME = 0;
  array[N] real Y_X1; // Potential Outcome Y(X = 1, Z)
  array[N] real Y_X0; // Potential Outcome Y(X = 0, Z)
    
  for (n in 1:N) {
    // Sample the Baseline Covariates
    row_i = categorical_rng(boot_probs);
      
    // Sample Y(x) where x = 1 and x = 0
    Y_X1[n] = bernoulli_logit_rng(b_Intercept + Xa[row_i] * b);
    Y_X0[n] = bernoulli_logit_rng(b_Intercept + Xb[row_i] * b);
      
    // Add Contribution of the ith Observation to the Bootstrapped AME
    AME = AME + (Y_X1[n] - Y_X0[n])/N;
  }
    
  // Take the mean of the posterior expectations
  real EYX1 = mean(Y_X1); // E[Y | X = 1, Z]
  real EYX0 = mean(Y_X0); // E[Y | X = 0, Z]
    ",
  position = "end",
  block = "genquant"
  )

# Fit each of the models (6 chains, 8k iterations), takes about 3 seconds
femrebels_logit_fits <- map2(
  .x = model_data,
  .y = str_c("Party_Transition_Logit_", names(model_data),  ".rds"),
  .f = ~ brm(
    formula = femrebel_bf,
    family = brmsfamily("bernoulli", link = "logit"),
    prior = femrebel_logit_priors,
    data = .x,
    cores = 12, # Max number of cores to use for parallel chains
    chains = 6, # Number of chains, should be at least 4
    iter = 8000, # Total iterations = Warm-Up + Sampling
    warmup = 3000, # Warm-Up Iterations
    stanvars = gformula_vars, # Custom Stan code to include in the model
    save_pars = save_pars(all = TRUE),
    seed = 2023,
    backend = "cmdstanr",
    sample_prior = "yes",
    file = .y
  )
)

# Add PSIS LOO-CV and Marginal Likelihood
femrebels_logit_fits <- map(
  .x = femrebels_logit_fits,
  .f = ~ add_criterion(
    .x,
    criterion = c("loo", "marglik"),
    max_iter = 5e3,
    repetitions = 25,
    cores = 8,
    silent = TRUE
  ))

We can then use bridgesampling::post_prob to obtain the relative probability of each model based on the marginal likelihood estimates and brms::loo_model_weights to obtain cross-validation based stacking and pseudo-BMA weights. We can then use comparisons to estimate the model averaged probability contrasts.

# Get the posterior probability weights
postprob_weights <- bridgesampling::post_prob(
  femrebels_logit_fits[[1]]$criteria$marglik, # Full Model, Including Suicide Bombers
  femrebels_logit_fits[[2]]$criteria$marglik, # Full Model, Excluding Suicide Bombers
  prior_prob = c(0.5, 0.5), # This is the default, I'm just being explicit
  model_names = c("femrebels", "femrebels_exs")
)

# Calculate Pseudo-BMA+ Weights
pbma_weights <- loo_model_weights(
  femrebels_logit_fits[[1]],
  femrebels_logit_fits[[2]],
  method = "pseudobma",
  model_names = names(femrebels_logit_fits),
  BB = TRUE, # Pseudo-BMA+ with Bayesian Bootstrap
  BB_n = 10000, # Use 10k replications for Bayesian Bootstrap
  cores = 8L
)

# Calculate Stacking Weights
stacking_weights <- loo_model_weights(
  femrebels_logit_fits[[1]],
  femrebels_logit_fits[[2]],
  method = "stacking",
  model_names = names(femrebels_logit_fits),
  optim_control = list(reltol = 1e-10),
  cores = 8L
)

# Store all the weights in a named list
weights_ls <- list(
  "BMA" = apply(postprob_weights, 2, mean),
  "Stacking" = stacking_weights,
  "Pseudo-BMA+" = pbma_weights
)

# Print a comparison of the weights
print(weights_ls)

  $BMA
      femrebels femrebels_exs 
      0.2898716     0.7101284 
  
  $Stacking
  Method: stacking
  ------
                weight
  femrebels     0.000 
  femrebels_exs 1.000 
  
  $`Pseudo-BMA+`
  Method: pseudo-BMA+ with Bayesian bootstrap
  ------
                weight
  femrebels     0.255 
  femrebels_exs 0.745

# Calculate model averaged contrasts
comparisons_femrebels <- map(
  .x = weights_ls,
  ~ comparisons(
    femrebels_logit_fits[[1]],
    m1 = femrebels_logit_fits[[2]], # Subsequent models must be named arguments
    variables = "femrebel",
    type = "average",
    transform_pre = "differenceavg",
    weights = .x,
    method = "posterior_predict" # Accounts for observation-level uncertainty
  )
)

# Extract the posterior draws
comparisons_femrebels_draws <- map(
  .x = comparisons_femrebels,
  ~ posteriordraws(.x, shape = "long")
)

# Bind the draws frames into a data table
comparisons_femrebels_draws <- rbindlist(
  comparisons_femrebels_draws,
  idcol = "type"
  )

Figure 6 shows a side by side comparison of the weighted average of the contrasts based on different weighting approaches. The results here are largely the same as those above, but we have combined the two measures into a single model-wise mixture distribution and can incorporate what is effectively measurement uncertainty here directly into the quantity of interest. The marginaleffects implementation is somewhat limted by its dependence on brms::pp_average and it is on my list of things to do for the coming year to work out a more efficient approach that allows for the efficiency and flexibility of those illustrated in the preceding section.

Code

# Initialize the ggplot2 object for the model averaged AMEs
comparisons_femrebels_draws %>% 
ggplot(aes(y = draw, x = weights)) +
  # Violin plot looking things
  stat_eye(
    aes(
      slab_alpha = after_stat(pdf), 
      slab_fill = weights, 
      point_fill = weights, 
      shape = weights
    ),
    slab_size = 2,
    fill_type = "gradient",
    point_interval = "median_qi",
    .width = c(0.68, 0.89),
    point_size = 6,
    stroke = 2,
    show.legend = FALSE, 
    scale = 0.8
  ) +
  # Adjust the y axis scale
  scale_y_continuous(
    breaks = scales::pretty_breaks(n = 10),
    limits = c(-0.19, 0.54)
  ) +
  # Adjust fill scale
  scale_fill_manual(
    values = c("#23C3E4FF", "#C82803FF", "#2EF19CFF"), 
    aesthetics = "slab_fill"
  ) +
  # Adjust fill scale
  scale_fill_manual(
    values = c("#3AA2FCFF", "#FD8A26FF", "#99FE42FF"), 
    aesthetics = "point_fill"
  ) +
  # Adjust shape scale
  scale_shape_manual(values = c(22, 21, 24)) +
  # Add labels to the plot
  labs(
    y = parse(text = "bold('Model Averaged AME '*Delta)"),
    x = ""
  )

Figure 6: Model Averaged Marginal Effects for the Effect of Female Combatants on the Probability of Post-Conflict Party Transitions, {marginaleffects} Implementation

Conclusion

This concludes what was hopefully a detailed illustration of model averaged marginal effects with brms, marginaleffects, and Stan. The general takeaway here is that selecting a single model is usually the wrong approach and there is seldom any need to do so. While no method of estimation is perfect, BMA and its variants provide a principled, relatively robust way to propagate uncertainty in quantities typically of interest to researchers in the social sciences and can help us avoid unnecessary dichotomies that are, in most cases, false. By applying probability theory to average across posterior distributions when estimating average marginal effects, we can obtain interpretable model averaged estimates for both linear and non-linear regression models.

As far as subsequent developments are concerned, this is effectively the cornerstone of my dissertation work so I am hoping to have at least one subsequent post illustrating and implementing BMAMEs for multilevel models that employ non-linear link functions in the comining weeks. In the meantime, however, remember that nothing good ever comes from mixing odds and ratios or exposing your relatives to unnecessary risks and holding the random effects in a hierarchical model at their mean still gives you a conditional estimate that results in a weak connection between theory and data. Finally, if you catch any mistakes–and I can almost guarentee there are many in a post of this length–either leave a comment or let me know some other way (i.e., twitter).

References

Achen, Christopher H. 2005. “Let’s Put Garbage-Can Regressions and Garbage-Can Probits Where They Belong.” Conflict Management and Peace Science 22(4): 327–39.

Albert, Karen E. 2022. “What Is Rebel Governance? Introducing a New Dataset on Rebel Institutions, 19452012.” Journal of Peace Research 59(4): 622–30.

Arel-Bundock, Vincent. 2022. Marginaleffects: Marginal Effects, Marginal Means, Predictions, and Contrasts. https://vincentarelbundock.github.io/marginaleffects/.

Bartels, Larry M. 1997. “Specification Uncertainty and Model Averaging.” American Journal of Political Science 41: 641–74.

Box, George E. P. 1976. “Science and Statistics.” Journal of the American Statistical Association 71(356): 791–99.

Braithwaite, Jessica Maves, and Kathleen Gallagher Cunningham. 2019. “When Organizations Rebel: Introducing the Foundations of Rebel Group Emergence (FORGE) Dataset.” International Studies Quarterly 64(1): 183–93.

Bürkner, Paul-Christian. 2017. “brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80: 1–28.

———. 2018. “Advanced Bayesian Multilevel Modeling with the R Package brms.” The R Journal 10: 395–411.

Bürkner, Paul-Christian, Jonah Gabry, and Aki Vehtari. 2020. “Approximate Leave-Future-Out Cross-Validation for Bayesian Time Series Models.” Journal of Statistical Computation and Simulation 90(14): 2499–2523.

Clarke, Kevin A. 2005. “The Phantom Menace: Omitted Variable Bias in Econometric Research.” Conflict Management and Peace Science 22(4): 341–52.

———. 2007. “The Necessity of Being Comparative: Theory Confirmation in Quantitative Political Science.” Comparative Political Studies 40: 886–908.

———. 2009. “Return of the Phantom Menace.” Conflict Management and Peace Science 26(1): 46–66.

Cranmer, Skyler J., Douglas R. Rice, and Randolph M. Siverson. 2015. “What to Do about Atheoretic Lags.” Political Science Research and Methods 5(4): 641–65.

Daly, Sarah Zukerman. 2020. “Political Life After Civil Wars: Introducing the Civil War Successor Party Dataset.” Journal of Peace Research 58(4): 839–48.

Daniel, Rhian, Jingjing Zhang, and Daniel Farewell. 2020. “Making Apples from Oranges: Comparing Noncollapsible Effect Estimators and Their Standard Errors After Adjustment for Different Covariate Sets.” Biometrical Journal 63(3): 528–57.

Fong, E, and C C Holmes. 2020. “On the Marginal Likelihood and Cross-Validation.” Biometrika 107(2): 489–96.

Gelman, Andrew, and Xiao-Li Meng. 1998. “Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling.” Statistical Science 13(2): 163–85.

Gelman, Andrew, Daniel Simpson, and Michael Betancourt. 2017. “The Prior Can Often Only Be Understood in the Context of the Likelihood.” Entropy 19: 555.

Gill, Jeff. 1999. “The Insignificance of Null Hypothesis Significance Testing.” Political Research Quarterly 52: 647–74.

Gronau, Quentin F. et al. 2017. “A Tutorial on Bridge Sampling.” Journal of Mathematical Psychology 81: 80–97.

Gross, Justin H. 2014. “Testing What Matters (If You Must Test at All): A Context-Driven Approach to Substantive and Statistical Significance.” American Journal of Political Science 59(3): 775–88.

Hanmer, Michael J., and Kerem Ozan Kalkan. 2012. “Behind the Curve: Clarifying the Best Approach to Calculating Predicted Probabilities and Marginal Effects from Limited Dependent Variable Models.” American Journal of Political Science 57(1): 263–77.

Hinne, Max, Quentin F. Gronau, Don van den Bergh, and Eric-Jan Wagenmakers. 2020. “A Conceptual Introduction to Bayesian Model Averaging.” Advances in Methods and Practices in Psychological Science 3(2): 200–215.

Hollenbach, Florian M., and Jacob M. Montgomery. 2020. “Bayesian Model Selection, Model Comparison, and Model Averaging.” In The SAGE Handbook of Research Methods in Political Science and International Relations, eds. Luigi Curini and Robert Franzese. SAGE Publications Ltd, 937–60.

Imai, Kosuke, and Dustin Tingley. 2011. “A Statistical Method for Empirical Testing of Competing Theories.” American Journal of Political Science 56(1): 218–36.

Juhl, Sebastian. 2019. “The Sensitivity of Spatial Regression Models to Network Misspecification.” Political Analysis 28(1): 1–19.

Kay, Matthew. 2022. ggdist: Visualizations of Distributions and Uncertainty. https://mjskay.github.io/ggdist/.

Keil, Alexander P et al. 2017. “A Bayesian Approach to the g-Formula.” Statistical Methods in Medical Research 27(10): 3183–3204.

Long, J. Scott, and Sarah A. Mustillo. 2018. “Using Predictions and Marginal Effects to Compare Groups in Regression Models for Binary Outcomes.” Sociological Methods & Research 50(3): 1284–1320.

Lundberg, Ian, Rebecca Johnson, and Brandon M. Stewart. 2021. “What Is Your Estimand? Defining the Target Quantity Connects Statistical Evidence to Theory.” American Sociological Review 86(3): 532–65.

Manning, Carrie, and Ian Smith. 2016. “Political Party Formation by Former Armed Opposition Groups After Civil War.” Democratization 23(6): 972–89.

Montgomery, Jacob M., and Brendan Nyhan. 2010. “Bayesian Model Averaging: Theoretical Developments and Practical Applications.” Political Analysis 18(2): 245–70.

Montgomery, Jacob M., Brendan Nyhan, and Michelle Torres. 2018. “How Conditioning on Posttreatment Variables Can Ruin Your Experiment and What to Do about It.” American Journal of Political Science 62(3): 760–75.

Mood, Carina. 2009. “Logistic Regression: Why We Cannot Do What We Think We Can Do, and What We Can Do about It.” European Sociological Review 26(1): 67–82.

Norton, Edward C., Bryan E. Dowd, and Matthew L. Maciejewski. 2019. “Marginal Effectsquantifying the Effect of Changes in Risk Factors in Logistic Regression Models.” JAMA 321(13): 1304.

Oganisian, Arman, and Jason A. Roy. 2020. “A Practical Introduction to Bayesian Estimation of Causal Effects: Parametric and Nonparametric Approaches.” Statistics in Medicine 40(2): 518–51.

Plümper, Thomas, and Richard Traunmüller. 2018. “The Sensitivity of Sensitivity Analysis.” Political Science Research and Methods 8(1): 149–59.

Rozenas, Arturas, and Yuri M. Zhukov. 2019. “Mass Repression and Political Loyalty: Evidence from Stalin’s ‘Terror by Hunger’.” American Political Science Review 113(2): 569–83.

Schad, Daniel J. et al. 2022. “Workflow Techniques for the Robust Use of Bayes Factors.” Psychological Methods.

Schrodt, Philip A. 2014. “Seven Deadly Sins of Contemporary Quantitative Political Analysis.” Journal of Peace Research 51: 287–300.

Wang, Lazhi, David E. Jones, and Xiao-Li Meng. 2020. “Warp Bridge Sampling: The Next Generation.” Journal of the American Statistical Association: 1–17.

Ward, Michael D, Brian D Greenhill, and Kristin M Bakke. 2010. “The Perils of Policy by p-Value: Predicting Civil Conflicts.” Journal of Peace Research 47(4): 363–75.

Wood, Reed M., and Jakana L. Thomas. 2017. “Women on the Frontline.” Journal of Peace Research 54(1): 31–46.

Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2018. “Using Stacking to Average Bayesian Predictive Distributions (with Discussion).” Bayesian Analysis 13(3): 917–1007.

Footnotes

For a recent application of BMA in the context of instrument selection see Rozenas and Zhukov (2019).↩︎
It is entirely possible this is wrong and does not actualy make sense so if someone has a better idea or suggestions for improvements I would love to hear them.↩︎
There is no such thing as an “uninformative prior”; the concept itself is not mathematically defined.↩︎
As Plümper and Traunmüller (2018) demonstrate, BMA performs quite well in terms of this latter objective.↩︎
Those familiar with the post-conflict rebel parties literature in political science are likely rolling their eyes at this sentence because in its present state it is not clear we know much of anything about causal relationships.↩︎
Estimation is performed under Stan version 2.30.1 on a desktop computer with a Ryzen 9 5900X CPU and 128GB of DDR4 memory.↩︎
One might contrast this with frequentist p-values in the NHST framework common in the social sciences which answer the question under the simplifying assumption the null hypothesis is true, what is the probability of observing a result more extreme than the one we did? Although still widely taught, this provides a very precise answer to a question few researchers are interested in and rejecting the strawman null that the effect of $X$ is exactly equal to zero–something no respectable researcher actually believes–tells you nothing about your vaguely defined alternative (Gill 1999; Gross 2014).↩︎
This is only in part because I can’t figure out the correct brms::do_call syntax to pass both a list of marglik objects and model priors to bridgesampling::post_prob.↩︎

Reuse

All content in this post is made available for public use under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) License.

Citation

BibTeX citation:

@online{jordannafa2022,
  author = {A. Jordan Nafa},
  title = {An {Introduction} to {Being} {Less} {Wrong}},
  date = {2022-12-31},
  url = {https://www.ajordannafa.com/blog/2022/being-less-wrong},
  langid = {en}
}

For attribution, please cite this work as:

A. Jordan Nafa. 2022. “An Introduction to Being Less Wrong.” https://www.ajordannafa.com/blog/2022/being-less-wrong.

--- title: An Introduction to Being Less Wrong subtitle: Bayesian Hypothesis Testing and Model Averaged Marginal Effects for the Social Sciences pagetitle: An Introduction to Being Less Wrong date: 2022-12-31 description: "This post provides an introduction to Bayesian Model Averaging and Model Averaged Marginal Effects with Stan, `{brms}`, and `{marginaleffects}`" categories: [R, Bayesian Statistics, Statistics, Marginal Effects, Stan, Logit, Model Averaging] license: All content in this post is made available for public use under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) License. image: figs/fig-model-averaged-ame-margins.svg toc-location: left page-layout: full code-fold: true # Execution options execute: freeze: true # Social Media Information open-graph: title: "An Introduction to Being Less Wrong" description: "Bayesian Hypothesis Testing and Model Averaged Marginal Effects for the Social Sciences" image: "figs/bmame-blog-post-preview.png" twitter-card: title: "An Introduction to Being Less Wrong" image: "figs/bmame-blog-post-preview.png" card-style: summary_large_image description: "Bayesian Hypothesis Testing and Model Averaged Marginal Effects for the Social Sciences" creator: "@adamjnafa" site: "@adamjnafa" --- ```{r setup, include=FALSE} knitr::opts_chunk$set( cache = FALSE, warning = FALSE, message = FALSE, comment = ' ', fig.path = 'figs/', fig.align = "center", fig.process = function(x) { x2 = sub('-\\d+([.][a-z]+)$', '\\1', x) if (file.rename(x, x2)) x2 else x } ) # Load the necessary libraries pacman::p_load( "tidyverse", "arrow", "data.table", "brms", "furrr", "cmdstanr", "patchwork", "ggdist", "marginaleffects", install = FALSE ) # Custom theme for data visualizations plot_theme <- function(title_size = NULL, xaxis_size = NULL, yaxis_size = NULL, strip_size = NULL, strip_face = NULL, caption.hjust = 1, caption.vjust = 0, x_axis_face = NULL, y_axis_face = NULL, transparent = FALSE, axis_text_size = NULL, legend_text_size = NULL, subtitle_size = NULL, caption_size = NULL, base_size = 12, ...) { .theme <- theme_minimal(base_size = base_size) + theme( # Specify the default settings for the plot title plot.title = element_text( size = title_size, face = "bold", family = "serif" ), # Specify the default settings for caption text plot.caption = element_text( size = caption_size, family = "serif", hjust = caption.hjust, vjust = caption.vjust ), # Specify the default settings for subtitle text plot.subtitle = element_text( size = subtitle_size, family = "serif" ), # Specify the default settings specific to the x axis title axis.title.y = element_text( size = yaxis_size, face = y_axis_face, family = "serif", margin = margin(r = 10, l = -10) ), # Specify the default settings specific to the y axis title axis.title.x = element_text( size = xaxis_size, face = x_axis_face, family = "serif", margin = margin(t = 10, b = -10) ), # Specify the default settings for x axis text axis.text.x = element_text( size = axis_text_size, family = "serif", face = x_axis_face ), # Specify the default settings for y axis text axis.text.y = element_text( size = axis_text_size, family = "serif", face = y_axis_face ), # Specify the default settings for legend titles legend.title = element_text( size = legend_text_size, face = "bold", family = "serif" ), # Specify the default settings for legend text legend.text = element_text( size = legend_text_size, family = "serif" ), # Set the strip background fill to blank strip.background = element_blank(), # Adjust the strip text size settings strip.text = element_text( family = "serif", size = strip_size, face = strip_face ), # Additional Settings Passed to theme() ... ) # Plot Transparency if (transparent == TRUE) { .theme <- .theme + theme( panel.background = element_rect(fill = "transparent", colour = NA), plot.background = element_rect(fill = "transparent", colour = NA), legend.background = element_rect(fill = "transparent", colour = NA), legend.key = element_rect(fill = "transparent", colour = NA) ) } return(.theme) } ## Set the base theme settings for ggplot2 theme_set(plot_theme( xaxis_size = 24, yaxis_size = 24, title_size = 24, caption_size = 24, axis_text_size = 24, strip_face = "bold", y_axis_face = "bold", x_axis_face = "bold", plot.margin = margin(2, 2, 4, 5, "mm"), plot.caption.position = "plot", plot.title.position = "plot", strip_size = 24, legend_text_size = 24, legend.position = "top", caption.hjust = 0, caption.vjust = -1, transparent = TRUE )) ``` # Introduction George E. P. Box famously described the problem of model development in statistics and related fields by noting, *all models are wrong, but some are useful* [-@Box1976]. That is, all models are necessarily simplifications of a much more complex reality and statistics is by no means an algorithmic truth generating process because no model can ever be true in the pure sense of the word. Yet, this at times underappreciated reality does not render all models useless but rather implies the goal of those who practice statistics in its various forms is to develop and identify models that are useful in a never ending quest to be less wrong. In this post I introduce an approach to this task that has been the subject of my own dissertation work on the use of Bayesian Model Averaging (BMA) to account for uncertainty in model specification when estimating average marginal effects in non-linear regression models. ::: a I outline here what I term a Bayesian Model Averaged Marginal Effect (BMAME) in the context of logistic regression models in political science and illustrate how to obtain BMAMEs in a straightforward manner using the `{marginaleffects}` R package [@ArelBundock2022a], which thanks to Vincent Arel-Bundock's assistance supports model averaged or stacked average marginal effects for any model fit with the Stan interface `{brms}` [@Burkner2017; @Burkner2018]. I also provide a more general demonstration in Stan that makes the method easily applicable in Python and high dimensional settings. Although I focus here on models that include only population-level effects, recent feature additions to `{marginaleffects}` now make it possible to obtain BMAMEs from more complex hierarchical models and I will cover such applications in a subsequent post. ::: # Bayesian Model Averaging in Political Science Although Bayesian Model Averaging was introduced to political science more than two decades ago by @Bartels1997, cases of its use in applied work remain rare and are largely confined to the topic of political methodology [@Juhl2019; @Cranmer2015; @Montgomery2010].[^1] This presents something of a problem given the numerous studies demonstrating the dangers of tossing a large number of likely correlated predictors into a regression model [@Achen2005; @Clarke2005; @Clarke2009; @Montgomery2018]; the reality that if we wish to adjudicate between two or more competing theories, comparing coefficients in a single model is generally insufficient to accomplish such a task [@Imai2011; @Clarke2007; @Hollenbach2020]; and the difficulty of assessing what we truly known about political phenomenon that results from an obsession with "statistical significance" and precise answers to poorly defined questions [@Ward2010; @Schrodt2014; @Lundberg2021]. [^1]: For a recent application of BMA in the context of instrument selection see @Rozenas2019. ::: a In minor contrast to the past advice of @Montgomery2010 who suggest "BMA is best used as a subsequent robustness check to show that our inferences are not overly sensitive to plausible variations in model specification" (266), I argue here model averaging can and should be used as far more than a robustness check if for no other reason than a model-wise mixture distribution for the paramter of interest is almost certainly more informative than a point estimate from a single model specification. Indeed, BMA provides one of the few natural and intuitive ways of resolving some of the issues outlined in the preceding paragraph, particularly as it pertains to the evaluation of non-nested theoretical models, while also exhibiting a lower false positive rate than alternative approaches [@Pluemper2018]. This section provides a brief explanation of BMA and its connection to model averaged marginal effects before proceeding to a discussion of how the procedure is implemented. ::: ## Bayesian Model Comparison and Model Averaging Consider a simple case in which we have a set of $\mathcal{M}$ models, each of which characterizes a possible representation of some unknown data generation process. Although it remains common practice in applied political science to select a single model based on a largely arbitrary fit statistic (i.e., AIC, $\chi^{2}$, or whatever people are getting a p-value from this week), such an approach is problematic since it ignores the uncertainty inherent in the process of model specification. Rather than selecting a single model and consequently placing an implicit zero prior on **every other possible model specification you could have reasonably considered but did not**, Bayesian Model Averaging and its analogues provide a means of averaging across several possible, potentially non-nested, model specifications and accounting for the uncertainty associated with doing so. ::: a Imagine we wish to adjudicate between two competing theoretical models in the set of candidate models $\mathcal{M}$. The posterior odds of model $i$ relative to an alternative $k$ can be expressed as ::: $$ \underbrace{\frac{\Pr(\mathcal{M}_{i} \, | \, y)}{\Pr(\mathcal{M}_{k} \, | \, y)}}_{\mathrm{Posterior~Odds}} = \underbrace{\frac{p(y \, | \, \mathcal{M}_{i})}{p(y \, | \, \mathcal{M}_{k})}}_{\mathrm{Bayes~Factor}} \times \underbrace{\frac{\pi(\mathcal{M}_{i})}{\pi(\mathcal{M}_{k})}}_{\mathrm{Model~Prior}} $$ {#eq-posterior-odds} where $\pi(\mathcal{M})$ is the prior probability of model $i$ over model $k$ and $p(y \,|\, \mathcal{M})$ is the marginal likelihood of the observed data under each of the candidate models such that $$ \underbrace{p(y \,|\, \mathcal{M})}_{\mathrm{Marginal~Likelihood}} = \int\underbrace{p(y \,|\, \theta,\, \mathcal{M})}_{\mathrm{Likelihood}} \, \underbrace{\pi(\theta \,|\, \mathcal{M})}_{\mathrm{Prior}}d\theta $$ {#eq-marginal-likelihood} ::: a Given our prior assumptions about how likely the observed data are to have arisen from each model in the set $\mathcal{M}$, the result of equation @eq-posterior-odds is the posterior odds of model $i$ compared to model $k$ and thus captures the relative probability that model $\mathcal{M_{i}}$ represents the true data generation process compared to the plausible alternative $\mathcal{M_{k}}$. We can easily extend this example to the setting in which $\mathcal{M_{k}}$ is a set of plausible competing models of size $k > 2$, in which case the posterior model probability of the $i^{th}$ model relative to each of the alternatives $k$ is ::: $$ \Pr(\mathcal{M}_{i} \,|\, y) = \frac{p(y \, | \, \mathcal{M}_{i}) \, \cdot \, \pi(\mathcal{M}_{i})}{\displaystyle\sum^{\mathcal{m}}_{k=1} p(y \, | \, \mathcal{M}_{k}) \, \cdot \, \pi(\mathcal{M}_{k})} $$ {#eq-posterior-probability} Applying equation @eq-posterior-probability to each model in the set under consideration yields a vector of relative posterior model probabilities of length $m$ which may be used as either posterior probability weights as in the case of Bayesian Model Averaging or for the estimation of posterior inclusion probabilities. ::: a In the context of model averaging, we can take draws from the posterior predictive distribution of each model containing a given parameter of interest proportional to its posterior probability weight obtained from equation @eq-posterior-probability. This yields a model-wise mixture representing a weighted average of the posterior predictive distribution for some parameter such as a regression coefficient for a linear model or as I outline below, an average marginal effect or probability contrast for models that employ a non-linear link function. ::: ## Bayesian Model Averaged Marginal Effects Average marginal effects are ubquitous in the social sciences though, as [Andrew Heiss nicely illustrates](https://www.andrewheiss.com/blog/2022/05/20/marginalia/), the term average marginal effect is often used ambiguously and in the context of more complex models, the terms marginal and conditional [tend to be a source of additional confusion](https://www.andrewheiss.com/blog/2022/11/29/conditional-marginal-marginaleffects/). In the case of a simple population-level linear model with a Gaussian likelihood and identity link function, the model averaged marginal effect is equivalent to the posterior distribution of the model averaged coefficient $\beta$ which can be expressed as $$ \mathbb{E}(\beta \, | \, y) = \sum_{i=1}^{m}\Pr(\mathcal{M}_{i} \, | \, y) \cdot \mathbb{E}(\beta\, |\, \mathcal{M}_{i}, \, y) \quad \forall \quad i \in \{1, 2,\dots, m\} $$ {#eq-bma-population} where $\Pr(\mathcal{M}_{i} ~ | ~ y)$ represent the posterior model probability and $\mathbb{E}(\beta ~|~ \mathcal{M}_{i}, ~ y)$ is the expected value of some parameter $\beta$ conditional on the observed data and some model specification for each model $i$ in the set of models under consideration $m$ [@Montgomery2010]. ::: a For probability models such as logit, and other commonly used formulations for modeling discrete responses, this simple closed form solution does not exist nor is there any meaningful way to interpret the coefficients of these models as efffect size estimates--a reality that unfortunately remains quite widely misunderstood [@Mood2009; @Daniel2020]. As a general principle, **any model more complex than a simple linear regression can only be meaningfully interpreted in terms of the predictions it generates** [@Long2018], which in the case of logistic regression simply means applying the inverse logistic function to obtain predicted probabilities which we can then use to estimate contrasts or average marginal effects [@Norton2019]. Letting $\mathrm{Y}$ represent a binary outcome, $\mathrm{X}$ a continuous exposure or treatment of interest, and $\mathrm{Z}$ a matrix of measured confounders, we can express the average marginal effect as ::: $$ \mathrm{AME}\Delta_{j} = \int \frac{\mathbb{E}[\mathrm{Y}_{ij} ~ | ~ \mathrm{X}_{ij} = \mathrm{x}_{ij} + h, \mathrm{Z}_{ij}] ~-~ \mathbb{E}[\mathrm{Y}_{ij} ~ | ~ \mathrm{X}_{ij} = \mathrm{x}_{ij}, \mathrm{Z}_{ij}]d\mathrm{Z}}{h} \\ $$ {#eq-logit-continuous-ame} which provided a sufficiently small value of $h$, yields a reasonable approximation of the partial derivative and provides a posterior distribution for the average effect of an instantaneous change in $\mathrm{X}$ on the probability scale similar to what one obtains in linear regression. ::: a Similarly, for a binary treatment of interest we apply the Bayesian g-formula to obtain the average marginal effect of $\mathrm{X}$ on the probability of $\mathrm{Y}$ as shown in equation @eq-logit-binary-ame. ::: $$ \mathrm{AME}\Delta_{j} = \int \mathbb{E}[\mathrm{Y}_{ij} ~ | ~ \mathrm{X}_{ij} = 1, \mathrm{Z}_{ij}] ~-~ \mathbb{E}[\mathrm{Y}_{ij} ~ | ~ \mathrm{X}_{ij} = 0, \mathrm{Z}_{ij}]d\mathrm{Z} \\ $$ {#eq-logit-binary-ame} Equations @eq-logit-continuous-ame and @eq-logit-binary-ame make clear we are averaging over the confounder distribution $\mathrm{Z}$ rather than holding it constant at some fixed value [@Oganisian2020; @Keil2017]. This point is an important one because in binary outcome models such as logit and probit, holding $\mathrm{Z}$ constant at some fixed value such as the mean corresponds to **an entirely different estimand** [@Hanmer2012]. The AME is a population-averaged quantity, corresponding to the population average treatment effect and these two quantities can produce very different results under fairly general conditions because **they do not answer the same question**. ::: a From here it is straightforward to obtain a model averaged marginal effect estimate for a discrete outcome model such as logit. Given marginal predictions for each model $k$ in the set $m$, we can apply equation @eq-logit-binary-bmame to obtain the posterior distribution of average marginal effects for each model. ::: $$ \begin{align} \mathrm{BMAME}\Delta_{j} &= \sum_{k=1}^{m} \Pr(\mathcal{M}_{k} | y)\cdot\int \mathbb{E}[\mathrm{Y}_{ij}, \mathcal{M}_{k} | \mathrm{X}_{ij} = 1, \mathrm{Z}_{ij}] ~-~ \mathbb{E}[\mathrm{Y}_{ij}, \mathcal{M}_{k} | \mathrm{X}_{ij} = 0, \mathrm{Z}_{ij}]d\mathrm{Z} \\ &= \sum_{k=1}^{m} \Pr(\mathcal{M}_{k} | y)\cdot\mathrm{AME}\Delta_{k,j} \end{align} $$ {#eq-logit-binary-bmame} This yields a model-wise mixture distribution representing the weighted average of the marginal effects estimates and has broad applicability to scenarios common in the social sciences, industry, and any other setting where a need arises to account for uncertainty in model specification. ## Limitations and Caveats While traditional marginal likelihood-based Bayesian model averaging is a powerful and principled technique for dealing with uncertainty in the process of model specification and selection, it is not without limitations and caveats of its own, some of which bear emphasizing before proceeding further. ### Marginal Likelihood and Computational Uncertainty First, for all but the simplest of models attempting to derive the integral in equation @eq-marginal-likelihood analytically proves to be computationally intractable and we must instead rely on algorithmic approximations such as bridge sampling [@Gronau2017; @Gelman1998; @Wang2020]. While the bridge sampling approximation tends to perform well compared to common alternatives, estimates of the marginal likelihood $p(y \,|\, \mathcal{M})$ may be highly variable upon repeated runs of the algorithm [@Schad2022]. This problem tends to arise in more complex hierarchical models with multiple varying effects and, perhaps unsurprisingly, suggests that by relying on approximations we introduce an additional source of computational uncertainty that may need to be accounted for. ::: a For the purposes of BMAMEs, I have elsewhere proposed addressing this additional source of uncertainty and incorporating it into the estimates by executing the bridge sampling algorithm $S$ times for each model to obtain a distribution of possible values for the approximate log marginal likelihood. Given a vector of estimates for $p(y ~|~ \mathcal{M})$ of length $s$ for each model $k \in \{1,2,\dots, m\}$, we can apply equation @eq-posterior-probability to obtain an $s \times m$ matrix of posterior model probabilities. It is then straightforward to extend equation @eq-logit-binary-bmame to average over the distribution of posterior probability weights rather than relying on single estimate or point estimate as shown in equation @eq-logit-bmame-reps below.[^2] $$ \begin{align} \mathrm{BMAME}\Delta_{j} &=\frac{1}{S}\sum_{s=1}^{S} \left[\sum_{k=1}^{m} \Pr(\mathcal{M}_{k, s} | y)\cdot\int \mathbb{E}[\mathrm{Y}_{ij}, \mathcal{M}_{k} | \mathrm{X}_{ij} = 1, \mathrm{Z}_{ij}] ~-~ \mathbb{E}[\mathrm{Y}_{ij}, \mathcal{M}_{k} | \mathrm{X}_{ij} = 0, \mathrm{Z}_{ij}]d\mathrm{Z}\right] \\ &= \frac{1}{S}\sum_{s=1}^{S}\left[\sum_{k=1}^{m} \Pr(\mathcal{M}_{k, s} | y)\cdot\mathrm{AME}\Delta_{k,j}\right] \end{align} $$ {#eq-logit-bmame-reps} Whether this approach is necessary depends largely on how variable the algorithm is on repeated runs and it may simply be adequate to instead take the median or mean of the marginal likelihood estimates for each model. Although it is beyond the scope of this post, one could also use exhaustive leave-future-out cross-validation which is equivalent to the marginal likelihood given a logarithmic scoring rule [@Fong2020; @Buerkner2020], though this approach tends to be extremely expensive in terms of computation. ::: [^2]: It is entirely possible this is wrong and does not actualy make sense so if someone has a better idea or suggestions for improvements I would love to hear them. ### Bayes Factors and Prior Sensitivity The second caveat with respect to obtaining BMAMEs stems from a common criticism of Bayes Factors. Although as the amount of available data $n \longrightarrow \infty$, the likelihood will tend to dominate the prior in Bayesian estimation, this is not necessarily the case in terms of inference. Indeed, from equations @eq-marginal-likelihood and @eq-posterior-probability one can quite clearly see the the posterior probability $\Pr(\mathcal{M} ~| ~y)$ depends on the prior probability of the parameters, $\pi(\theta \,|\, \mathcal{M})$ in equation @eq-marginal-likelihood, and the respective prior probability $\pi(\mathcal{M})$ of each model in the set under consideration. Since our substantive conclusions may be sensitive to either of these prior specifications, these assumptions should be checked to verify the results do not vary wildly under reasonable alternatives. ::: a Perhaps more importantly, due to their dependence on Bayes Factors, posterior probability weights require at least moderatlely informative priors on each and every parameter in a model and it is by this point common knowledge that specifying a flat or "uninformative" prior such as $\beta \sim \mathrm{Uniform(-\infty, \infty)}$ or even an excessively vague one tends to bias Bayes Factors--and by extension values that depend on them--violently in favor of the null model.[^3] This is a feature of applied Bayesian inference rather than a bug and **if you make stupid assumptions, you will arrive at stupid conclusions**. The solution to this problem is to think carefully about the universe of possible effect sizes you might observe for a given parameter, assign reasonable priors that constrain the parameter space, and verify the results are robust to alternative distributional assumptions. [^3]: There is no such thing as an "uninformative prior"; the concept itself is not mathematically defined. ::: ::: {.callout-caution collapse="true"} ### Likeihoods and Priors The degree to which a prior can be considered informative is heavily dependent on the likelihood and can often only be understood in that context [@Gelman2017]. ::: ### Stacking, and the Open-$\mathcal{M}$ Problem Finally, traditional BMA is flawed in the sense that its validity rests upon the closed $\mathcal{M}$ assumption implicit in equation @eq-posterior-probability. In the open $\mathcal{M}$ setting where the true model is not among those in the set of candidate models under consideration, as will generally be the case in any real world application, estimated effects based on posterior probability weights are likely to be biased [@Hollenbach2020; @Yao2018]. This problem arises from the fact that marginal-likelihood based posterior probability weights for each model $m \in \mathcal{M}$ correspond to the probability a given model captures the true data generation process and in most settings this is unlikely to ever be the case since, by definition, *all models are wrong*. ::: a In the open $\mathcal{M}$ setting, traditional BMA will still assign each model some probability but since by definition $\sum_{i=1}^{m}\Pr(\mathcal{M}_{i} ~|~ y) = 1$, these weights no longer have a valid interpretation in terms of the posterior probability a given model is true. Although this does not render marginal likelihood based model averaging useless and it may often make more sense to think about model specification and selection as a probabilistic process aimed at identifying the most likely model conditional on the set of models under consideration [@Hinne2020], it is advisable to assess whether our substantive conclusions are sensitive to potential violations of the closed $\mathcal{M}$ assumption. This once again underscores the fact that statistics is not a mechanistic truth generating process and *there is no magic*. It also underscores the need to think in more local terms--rather than asking *is this model true* we generally want to know *the probability $\mathrm{X}$ belongs to the true data generation process*.[^4] [^4]: As @Pluemper2018 demonstrate, BMA performs quite well in terms of this latter objective. An alternative but still properly Bayesian approach aimed at addressing these potential flaws in traditional BMA is cross-validation based stacking [@Yao2018]. In contrast to traditional BMA, stacking and pseudo-BMA weights are constructed based on the relative expectation of the posterior predictive density [@Hollenbach2020; @Yao2018]. These approaches to estimating model weights do not rely upon the closed $\mathcal{M}$ assumption, and thus provide a way of either avoiding it altogether or relaxing it as a robustness check on the posterior probability weights, albeit while answering a fundamentally different question more appropriate for an open $\mathcal{M}$ world. Since the overall approach to calculating model averaged AMEs does not change and we are merely substituting the posterior probability weights in equation @eq-logit-binary-bmame with stacking or pseudo-BMA weights, I direct readers to @Yao2018 for a formal discussion of these approaches. ::: # Female Combatants and Post-Conflict Rebel Parties To illustrate the utility of BMAMEs in political science, I draw on an example adapted from the second chapter of my dissertation in which I examine how the active participation of women in rebel groups during wartime shapes the political calculus of former combatant groups at war's end. In short, I expect rebel groups where women participate in combat roles during wartime to be *more likely* to form political parties and participate in post-conflict elections. ## Data To identify the universe of post-conflict successor parties to former combatant groups I rely on the Civil War Successor Party (CWSP) dataset which includes all civil wars in the UCDP Armed Conflicts Dataset (UCDP) that terminated between 1970 and 2015 [@Daly2020]. Since the primary distinction between political parties and other civil society organizations such as interest groups lies in the latter's participation in elections and fielding candidates for political office and a party that does not compete in elections is no party at all, I operationalize our dependent variable, *Rebel to Party Transition*, as a binary event which takes a value of 1 if a rebel group both forms a political party and directly participates in a country's first post-conflict elections and 0 otherwise. The resulting sample consists of 112 observations for 108 unique rebel groups across 56 countries, approximately 66% of which formed parties and participated in their country's first post-conflict election. ::: a Following @Wood2017, I define female combatants as members of a rebel group who are "armed and participate in organized combat activities on behalf of a rebel organization" (38). I code instances of female participation in rebel groups by mapping version 1.4 of the Women in Armed Rebellion Data (WARD) to the CWSP and construct two versions of the main predictor of interest based on the best estimate for the presence of female combatants in a given rebel group, one of which includes cases in which women served only as suicide bombers while the other codes these groups as not having any women in combat positions. Each of these variables takes a value of 1 if women participated in combat in a given rebel group and 0 otherwise. Among the rebel groups in our analysis, female combatants have been active in some capacity in 52.67% of those who were involved in civil wars that have terminated since 1970 and 50.89% featured women in non-suicide bombing roles. I also adjust for several additional factors that may otherwise confound the estimate of the effect of female participation in combat on rebel successor party transitions. First, insurgent groups that attempt to perform functions typically characteristic of the state during wartime may be more likely to transition to political parties following the end of a conflict since, in doing so, they both develop organizational capacity and if successful, demonstrate that they are capable of governing to their future base of support. Moreover, as rebel groups build organizational capacity and accumulate institutional knowledge, their ability to sustain a successful insurgency and recruit combatants grows as well. To adjust for rebel organizational capacity I specify a Bayesian multidimensional item response model to construct a latent scale capturing the degree to which rebel groups engaged in functions of the state during wartime based on twenty-six items from Quasi-State Institutions dataset [@Albert2022]. Second, I adjust for baseline characteristics of each rebel group that previous research suggests influences both where women rebel and post-conflict party transitions. Drawing on data from @Braithwaite2019, I construct a series of dichotomous indicator variables for whether a rebel group is affiliated with an existing political party [@Manning2016], nationalist and left-wing or communist ideological origins [@Wood2017], and whether a group has ties to a specific ethnic community since this may provide advantages in terms of mobilizing combatants and voters along specific cleavages. I also construct two additional indictators to capture whether the United Nations intervened in the conflict and whether a given rebel group was located in Africa. While it is, of course, impossible in the context of an observational setting such as this one to isolate every possible source of confounding, nor is it generally advisable to attempt to do so, I focus on these baseline covariates since existing research provides a strong theoretical basis to suspect they impact the relationship of interest.[^5] For the purposes of this illustration, I assume this set of covariates is sufficient to satisfy strong ignorability but such an assumption is likely unreasonable and the particular analysis here may ultimately best be taken as descriptive in nature. ::: [^5]: Those familiar with the post-conflict rebel parties literature in political science are likely rolling their eyes at this sentence because in its present state it is not clear we know much of anything about causal relationships. ```{r} # Load the required libraries pacman::p_load( "tidyverse", "arrow", "data.table", "brms", "furrr", "cmdstanr", "patchwork", "ggdist", "marginaleffects", install = FALSE ) # Read in the pre-processed data model_df <- read_parquet("data/model_data.gz.parquet") ``` ## Model and Priors To estimate the effect of women's participation in combat on post-conflict rebel party transitions for this example, I specify a simple logit model as shown in equation @eq-logit-model. The bernoulli probability $\mu$ of a rebel group transitioning to a political party at war's end is modeled as a function of the population-level intercept $\alpha$ and a vector of coefficients $\beta_{k}$ corresponding to the matrix of inputs $X_{n}$. $$ \begin{align} y_{i} &\sim \mathrm{Bernoulli}(\mu)\\ \mathrm{logit}(\mu) &= \alpha + X_{n}\beta_{k}\\ \text{with priors}\\ \alpha &\sim \mathrm{Normal}(0, 1)\\ \beta_{k} &\sim \mathrm{Normal}(0, 0.75) \end{align} $$ {#eq-logit-model} ::: a Since we model $\mu$ via a logit link function, our priors on the intercept $\alpha$ and coefficients $\beta_{k}$ should correspond to the expected change in the log odds of the response for each predictor. Some may find the process of prior elicitation in logistic regression models difficult since reasoning on the logit scale may seem unintuitive. As figure @fig-logit-scale illustrates, the the log odds of the response for models with that employ a logit link function is effectively bounded between approximately $-6$ and $6$. Values in excess of $|5|$ generally should not occur unless your data exhibits near complete or quasi-complete separation. ::: ```{r fig-logit-scale, echo = TRUE, dev='svg'} #| label: fig-logit-scale #| fig-width: 14 #| fig-height: 12 #| fig-align: center #| fig-dpi: 300 #| fig-cap: "Scale of the Logit Link" #| ## Simulate logit probabilities logit_scale <- tibble( x = seq(-6.5, 6.5, 0.0001), y = brms::inv_logit_scaled(x) ) ## Plot the logit scale ggplot(logit_scale, aes(x = x, y = y)) + ## Add a line geom geom_line(size = 3, color = "firebrick") + ## Add labels labs(x = "Log Odds", y = "Probability") + ## Adjust the x axis scale breaks scale_x_continuous(breaks = seq(-6, 6, 1)) + ## Adjust the y axis scale breaks scale_y_continuous(breaks = seq(0, 1, 0.1)) ``` ::: a Framed in terms of effect sizes, a change of $\pm 3$ in the log odds of the response relative to the baseline spans half the logit scale in either direction. In the social sciences where data is largely random noise and effect sizes are typically small to moderate at best and while the road to hell is paved with *rules of thumb*, we can generally assume *a priori* that 1) any true effect of a given parameter $\beta$ in the population is likely modest in magnitude and 2) values closer zero are more likely than those further away. From these relatively straightforward assumptions, we can arrive at a prior for the coefficients that follows a normal distribution with mean 0 and scale $0.75$ as shown in equation @eq-logit-model. In practical terms, this assumes the universe possible effect sizes for each predictor is normally distributed in the range of approximately $-1.5$ and $1.5$ on the logit scale. For the population-level intercept $\alpha$ I assign a slightly more diffuse normal prior mean 0 and scale 1 or $\alpha \sim \text{Normal}(0,~1)$ which reflects a slightly greater degree of uncertainty about the baseline probability but is still moderately informative and constrains the bulk of the prior density in the range of realistic values. It is important to note here that as shown in Stan model below, we are specifying a prior on the intercept for the centered inputs rather than on the original scale. To more explicitly communicate our prior assumptions, we can use the tools for visualizing uncertainty provided in the `{ggdist}` package as shown in figure @fig-logit-priors below [@Kay2022]. ::: ```{r logit-priors-code, echo = TRUE, eval = FALSE} # Simulate the prior probabilities logit_priors <- tibble( alpha_prior = rnorm(10e5, 0, 1), beta_prior = rnorm(10e5, 0, 0.75) ) %>% ## Pivot the simulated data to long form pivot_longer(everything()) # Set parameter names for the facets---- logit_prior_labels <- as_labeller( x = c( "beta_prior" = "bolditalic(beta[k] * phantom(.) * phantom() %~% phantom() * phantom(.) * Normal(0*list(,) * phantom(.) * 0.75))", "alpha_prior" = "bolditalic(alpha[k] * phantom(.) * phantom() %~% phantom() * phantom(.) * Normal(0*list(,) * phantom(.) * 1))" ), default = label_parsed ) # Plot the prior distributions logit_prior_plots <- logit_priors %>% # Initiate the ggplot object ggplot(aes(x = value, fill = name, shape = name)) + # Facet by the prior distributions facet_wrap(~ name, scales = "free", labeller = logit_prior_labels) + # Add a half eye geom for the prior distributions stat_halfeye( aes(slab_alpha = stat(pdf)), .width = c(.68, .95), fill_type = "gradient", show.legend = FALSE ) + # Manually set the fill colors scale_fill_manual(values = c("#208820", "#0060B0")) + # Manually set the shapes scale_shape_manual(values = c(22, 24)) + # Adjust the breaks on the y axis scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) + # Add labels to the plot labs( y = "Prior Probability Density", x = latex2exp::TeX(r'(Log $\Pr(\theta_{prior})$)', bold = T) ) # Plot back transformed probabilities prob_trans_plots <- logit_priors %>% # Initiate the ggplot object ggplot(aes(fill = name, shape = name)) + # Facet by the prior distributions facet_wrap(~ name, scales = "free", labeller = logit_prior_labels) + # Add a half eye geom for the prior distributions stat_halfeye( aes(x = plogis(value), slab_alpha = stat(pdf)), .width = c(.68, .95), fill_type = "gradient", show.legend = FALSE ) + # Manually set the fill colors scale_fill_manual(values = c("#208820", "#0060B0")) + # Manually set the shapes scale_shape_manual(values = c(22, 24)) + # Add labels to the plot labs( y = "Prior Probability Density", title = "", x = latex2exp::TeX(r'($\Pr(\theta_{prior})$)', bold = TRUE) ) + # Adjust the breaks on the x axis scale_x_continuous(breaks = scales::pretty_breaks(n = 8)) ## Print the combined plot logit_prior_plots/prob_trans_plots ``` ```{r, echo = FALSE} #| label: fig-logit-priors #| fig-align: center #| fig-cap: "Priors for the Logit Model Parameters" knitr::include_graphics("figs/Prior_Probs_S2.png") ``` ::: a Since the model here is computationally quite efficient--on average, each model takes approximately 2.5 seconds to converge with 30,000 post-warmup draws--and in the interest of completeness, I search the full parameter space which consists of $2^{k}$ possible model specifications. Although it is possible to do this in this context, researchers will often instead need to place constrain the number of possible model configurations, generally by specifying an implicit prior model probability of 1 for certain parameters [@Montgomery2010; @Hinne2020]. Although `{brms}` provides an easy to use R interface to Stan that makes fitting a wide variety of Bayesian regression models easier than ever before and requires no direct knowledge of Stan, it is limited for my purposes here by the fact that it requires recompiling the underlying Stan model for each fit. This process ends up being extremely inefficient for a large number of model configurations since each model takes three times as long to compile as it does to fit. To circumvent this issue, I implement the model directly in Stan in a way designed to be directly compatible with `{brms}`. For models containing the parameter of interest, the model implments the bayesian g-formula to estimate the average marginal effect of women's participation in combat on the probability of a rebel-to-party transition during the estimation process in the generated quanitites block. As I previously demonstrated in [this post](../statistical-adjustment-interpretation/index.qmd), it is then possible to compile the Stan model once and use the same program for each dataset using `{cmdstanr}` directly. Further computational gains can be obtained by using the `{furrr}` package to fit multiple models in parallel and I specify a series of functions to streamline this process as shown below. ::: ::: panel-tabset ### g-formula ```stan // Bayesian Logit Model with Optional Bootstrapped AMEs via the Bayesian g-formula data { int<lower=1> N; // N Observations array[N] int Y; // Response Variable int<lower=1> K; // Number of Population-Level Effects matrix[N, K] X; // Population-level Design Matrix int prior_only; // Should the likelihood be ignored? int treat_pos; // Data for the bootstrapped AMEs } transformed data { int Kc = K - 1; matrix[N, Kc] Xc; // centered version of X without an intercept vector[Kc] means_X; // column means of X before centering for (i in 2 : K) { means_X[i - 1] = mean(X[ : , i]); Xc[ : , i - 1] = X[ : , i] - means_X[i - 1]; } // Bootstrap Probabilities vector[N] boot_probs = rep_vector(1.0/N, N); // Matrices for the bootstrapped predictions matrix[N, Kc] Xa; matrix[N, Kc] Xb; // Potential Outcome Y(X = 1, Z) Xa = X[, 2:K]; Xa[, treat_pos - 1] = ones_vector(N); // Potential Outcome Y(X = 0, Z) Xb = X[, 2:K]; Xb[, treat_pos - 1] = zeros_vector(N); } parameters { vector[Kc] b; // Regression Coefficients real Intercept; // Centered Intercept } transformed parameters { real lprior = 0; // Prior contributions to the log posterior lprior += normal_lpdf(b | 0, 0.75); lprior += normal_lpdf(Intercept | 0, 1); } model { // likelihood including constants if (!prior_only) { target += bernoulli_logit_glm_lpmf(Y | Xc, Intercept, b); } // priors including constants target += lprior; } generated quantities { // Actual population-level intercept real b_Intercept = Intercept - dot_product(means_X, b); // Additionally sample draws from priors array[Kc] real prior_b = normal_rng(0, 0.75); real prior_Intercept = normal_rng(0, 1); // Row index to be sampled for bootstrap int row_i; // Calculate Average Marginal Effect in the Bootstrapped sample real AME = 0; array[N] real Y_X1; // Potential Outcome Y(X = 1, Z) array[N] real Y_X0; // Potential Outcome Y(X = 0, Z) for (n in 1:N) { // Sample the Baseline Covariates row_i = categorical_rng(boot_probs); // Sample Y(x) where x = 1 and x = 0 Y_X1[n] = bernoulli_logit_rng(b_Intercept + Xa[row_i] * b); Y_X0[n] = bernoulli_logit_rng(b_Intercept + Xb[row_i] * b); // Add Contribution of the ith Observation to the Bootstrapped AME AME = AME + (Y_X1[n] - Y_X0[n])/N; } // Take the mean of the posterior expectations real EYX1 = mean(Y_X1); // E[Y | X = 1, Z] real EYX0 = mean(Y_X0); // E[Y | X = 0, Z] } ``` ### Reference ```stan // Bayesian Logit Model for Models without the Treatment Variable data { int<lower=1> N; // total number of observations array[N] int Y; // response variable int<lower=1> K; // number of population-level effects matrix[N, K] X; // population-level design matrix int prior_only; // should the likelihood be ignored? } transformed data { int Kc = K - 1; matrix[N, Kc] Xc; // centered version of X without an intercept vector[Kc] means_X; // column means of X before centering for (i in 2 : K) { means_X[i - 1] = mean(X[ : , i]); Xc[ : , i - 1] = X[ : , i] - means_X[i - 1]; } } parameters { vector[Kc] b; // Regression Coefficients real Intercept; // Centered Intercept } transformed parameters { real lprior = 0; // Prior contributions to the log posterior lprior += normal_lpdf(b | 0, 0.75); lprior += normal_lpdf(Intercept | 0, 1); } model { // likelihood including constants if (!prior_only) { target += bernoulli_logit_glm_lpmf(Y | Xc, Intercept, b); } // priors including constants target += lprior; } generated quantities { // Actual population-level intercept real b_Intercept = Intercept - dot_product(means_X, b); // Additionally sample draws from priors array[Kc] real prior_b = normal_rng(0, 0.75); real prior_Intercept = normal_rng(0, 1); } ``` ### Functions ```r #' A Function for building combinations of formulas for model averaging #' #' @importFrom glue glue_collapse #' @importFrom purrr map_chr #' @importFrom stringr str_replace #' #' @param terms A named list of terms to create all unique combinations of. #' See usage example for direction on how to specify each term #' #' @param resp A string specifying the name of the response vector #' #' @param base_terms An optional argument indicating additional terms to #' include in all formulas. This can be used to place implicit constraints on #' the parameter space to reduce computational burden #' #' @param ... Reserved for future development but currently #' unused #' #' @return A character vector of model formulas containing all unique #' combinations of the elements in terms #' #' @export bma_formulas #' #' @examples #' # Specify the input for the resp argument #' response <- "fun" #' #' # Specify the inputs for the terms argument #' covariates <- list( #' age = c("", "age"), #' sex = c("", "sex"), #' rock = c("", "rock") #' ) #' #' # Build the formulas #' model_forms <- bma_formulas(terms = covariates, resp = response) #' bma_formulas <- function(terms, resp, base_terms = NULL, ...) { ## Validate the formula arguments stopifnot(exprs = { is.list(terms) is.character(resp) }) ## Make a list of all unique combinations of lhs ~ rhs formulas <- expand.grid(terms, stringsAsFactors = FALSE) ## Initialize a list to store the formulas in out <- list() ## Build a list of formulas for each pair formulas for (i in 1:nrow(formulas)) { out[[(i)]] <- glue_collapse( formulas[i, !is.na(formulas[i, ])], sep = " + " ) } # Paste the response term out <- paste(resp, "~", out) # If base_terms is non-null, add additional terms to all formulas if (!is.null(base_terms)) { out <- paste(out, base_terms, sep = " + ") } out <- map_chr( .x = out, ~ str_replace(.x, "character\$0\$", "1") ) return(out) } #' Fitting Models in Parallel with brms via cmdstanr #' #' @param formula A `brmsformula` object for the model. #' #' @param treat A string argument containing the name of the treatment #' or a regular expression to match against the columns in `data`. #' #' @param prior A `brmsprior` object containing the priors for the model. #' #' @param data A data frame containing the response and inputs named in #' `formula`. #' #' @param cores The maximum number of MCMC chains to run in parallel. #' #' @param chains The number of Markov chains to run. #' #' @param warmup The number of warmup iterations to run per chain. #' #' @param sampling The number of post-warmup iterations to run per chain. #' #' @param stan_models The compiled Stan program(s) to be passed to #' `cmdstanr::sample`. #' #' @param save_file The file path where the brmsfit object should be saved. #' #' @param ... Additional arguments passed to `brms::brm` #' #' @return An object of class `brmsfit` #' #' @export cmdstan_logit #' cmdstan_logit <- function(formula, treat, prior, data, cores, chains, warmup, sampling, stan_models, save_file, ...) { # Check if the file already exists file_check <- file.exists(save_file) if (isTRUE(file_check)) { # Read in the file if it already exists out <- readr::read_rds(file = save_file) } else { # Total iterations for brms niter <- sampling + warmup # Initialize an empty brmsfit object brms_obj <- brm( formula = formula, family = brmsfamily("bernoulli", link = "logit"), prior = prior, data = data, cores = cores, chains = chains, iter = niter, warmup = warmup, empty = TRUE, seed = 123456, sample_prior = "yes", ... ) # Get the data for the stan model stan_data <- standata(brms_obj) # Check whether the model contains the treatment stan_data$treat_pos <- .match_treat_args(stan_data, treat) # Select the Stan model to use based on the data if (!is.null(stan_data$treat_pos)) { cmdstan_model <- stan_models[["gformula"]] } else { cmdstan_model <- stan_models[["reference"]] } # Fit the brms model out <- .fit_cmdstan_to_brms(cmdstan_model, brms_obj, stan_data, cores, chains, warmup, sampling) # Write the brmsfit object to an rds file readr::write_rds(out, file = save_file, "xz", compression = 9L) } # Return the model object return(out) } # Function for detecting the treatment vector .match_treat_args <- function(stan_data, treat) { # Check if any of the columns in X match the treatment check_treat <- grepl(treat, colnames(stan_data$X)) # Get the position of the treatment if present if (any(check_treat)) { treat_pos <- which(check_treat) } else { treat_pos <- NULL } return(treat_pos) } # Function to fit the models with cmdstanr and store them in the empty brmsfit .fit_cmdstan_to_brms <- function(cmdstan_model, brms_obj, stan_data, cores, chains, warmup, sampling) { # Set the initial number of draws to 0 min_draws <- 0 # Repeat the run if any of the chains stop unexpectedly while (min_draws < (sampling * chains)) { # Fit the Stan Model fit <- cmdstan_model$sample( data = stan_data, seed = 123456, refresh = 1000, sig_figs = 5, parallel_chains = cores, chains = chains, iter_warmup = warmup, iter_sampling = sampling, max_treedepth = 13 ) # Update the check min_draws <- posterior::ndraws(fit$draws()) } # Convert the environment to a stanfit object and add it to the brmsfit brms_obj$fit <- rstan::read_stan_csv(fit$output_files()) # Rename the parameters brms_obj <- brms::rename_pars(brms_obj) # Store the expectations inside the brmsfit object if (!is.null(stan_data$treat_pos)) { # Extract the expectations and contrasts brms_obj$predictions <- fit$draws( variables = c("EYX0", "EYX1", "AME"), format = "draws_df" ) } # Store run times brms_obj$times <- fit$time() # Return the brmsfit object return(brms_obj) } ``` ::: ::: a We can then generate a formula for each of the $2^{k}$ model combinations, specify priors, and compile the Stan programs to be passed to `cmdstan_logit` as shown below. I estimate each of the 384 possible candidate models, running six parallel markov chains for iterations 8,000 each and discarding the first 3,000 after the initial warmup adaptation stage.[^6] This leaves 30,000 post-warmup draws per model for post-estimation inference. While this is well in excess of what is required for reliable estimation, the large number of samples is necessary to ensure stability of the bridge sampling approximation for the marginal likelihood [@Schad2022]. Fitting three models in parallel via `{furrr}`, estimation and approximate leave one out cross validation take less than 10 minutes in total to complete. Oweing in part to the fact I run the algorithm 250 times per model, however, the run time to estimate the marginal likelihood via bridge sampling takes roughly 9 hours in total. ::: [^6]: Estimation is performed under Stan version 2.30.1 on a desktop computer with a Ryzen 9 5900X CPU and 128GB of DDR4 memory. ::: panel-tabset ### Preparation ```r # Define the list of predictors, parameter space is 2^k model_terms <- list( femrebels = c("femrebels", "femrebels_exs", NA), left_ideol = c("left_ideol", NA), nationalist = c("nationalist", NA), un_intervention = c("un_intervention", NA), party_affil = c("party_affil", NA), ethnic_group = c("ethnic_group", NA), africa = c("africa", NA), qsi = c("qsi_pol + qsi_econ + qsi_social", NA) ) # Build the model formulas, results in 383 combinations model_forms <- bma_formulas(terms = model_terms, resp = "elect_party") %>% map(.x = ., ~ bf(.x)) # Specify priors for the model parameters logit_priors <- prior(normal(0, 0.75), class = b) + prior(normal(0, 1), class = Intercept) # Names of the files to write the models to fits_files <- str_c( main_priors_dir, "Party_Transition_Logit_Main_", seq_along(model_forms), ".rds" ) # Compile the Stan models bayes_logit_mods <- list( "reference" = cmdstan_model( stan_file = str_c( main_priors_dir, "Party_Transition_Logit_Main.stan" ), force_recompile = FALSE ), "gformula" = cmdstan_model( stan_file = str_c( main_priors_dir, "Party_Transition_Logit_Main_gformula.stan" ), force_recompile = FALSE ) ) ``` ### Estimation ```r # Planning for the future plan(tweak(multisession, workers = 3)) # Fit the models in parallel via {future} bayes_model_fits <- future_map2( .x = model_forms, .y = fits_files, .f = ~ cmdstan_logit( formula = .x, prior = logit_priors, treat = "femrebels|femrebels_exs", data = model_df, cores = 6, chains = 6, warmup = 3e3, sampling = 5e3, stan_models = bayes_logit_mods, save_file = .y, control = list(max_treedepth = 13), save_pars = save_pars(all = TRUE) ), .options = furrr_options( scheduling = 1, seed = TRUE, prefix = "prefix" ), .progress = TRUE ) # Takes about ~5 minutes in total ``` ### Post-Estimation ```r # Add PSIS LOO CV Criteria bayes_model_fits <- future_map2( .x = bayes_model_fits, .y = fits_files, .f = ~ add_criterion( .x, criterion = "loo", cores = 4, file = stringr::str_remove(.y, ".rds") ), .options = furrr_options( scheduling = 1, seed = TRUE, prefix = "prefix" ), .progress = TRUE ) # Takes about ~4 minutes in total # Add log marginal likelihood---- bayes_model_fits <- future_map( .x = bayes_model_fits, .f = ~ add_criterion( .x, criterion = "marglik", max_iter = 5e3, repetitions = 250, cores = 8, silent = TRUE ), .options = furrr_options( scheduling = 1, seed = TRUE, prefix = "prefix" ), .progress = TRUE ) # Takes about ~9.5 hours in total # Back from the future plan(sequential) ``` ::: ## Hypothesis Testing In a Bayesian framework, model comparison is formalized by means of posterior model probabilities as shown in equation @eq-posterior-probability. Given $m$ possible candidate models and letting $i$ denote those containing the parameter of interest and $k$ those that do not, we can obtain posterior *inclusion* and *exclusion* probabilities for a given parameter. The ratio of these two probabilities shown in equation @eq-inclusion-prob is an inclusion Bayes Factor and represents the relative posterior odds of models containing the parameter of interest compared to all of the candidate models that do not. $$ \mathrm{BF_{Inclusion}}~=~\frac{\sum_{i=1}\Pr(\mathcal{M}_{i} \, | \, y)}{\sum_{k=1}\Pr(\mathcal{M}_{k} \, | \, y)} $$ {#eq-inclusion-prob} ::: a In terms of hypothesis testing, posterior inclusion probabilities provide the *epistemologically correct* way to answer the questions most social science studies are concerned with--given our modeling assumptions, how likely are those models containing the treatment $\mathrm{X}$ compared to those that do not. In other words, a posterior inclusion probability provides a continuous measure of evidence we can use to evaluate how probable it is our treatment or intervention of interest belongs to the true data generation process [@Hinne2020].[^7] Although my interest here lies primarily in using the posterior model probabilities to obtain a model-wise mixture distribution of the substantive effect estimates, I breifly illustrate how one can formally test a hypothesis in a Bayesian framework and obtain a vague answer to a precise question. This may be useful in a sequential testing approach where we first wish to evaluate whether an effect exists before preceding to interpretation though one may also simply assume effects exist and focus on estimating their magnitude instead. For the purposes of this illustration, I compare those models containing each of the two measures of women's participation in combat during civil wars to each of the possible model specifications that does not include either of the two measures. Although by convention, it is common to take the prior model probabilities to be equally likely--in effect an implicit uniform prior--this is far from the only possible approach and in many cases, particularly when adjudicating between competing theories, it may make more sense to weight the hypotheses according to some external source such as existing research on the subject, meta-analyses, etc. For simplicity here, I assume each of the models is equally likely a priori and obtain posterior probabilities and inclusion Bayes Factors for the parameters of interest as shown below.[^8] ::: [^7]: One might contrast this with frequentist p-values in the NHST framework common in the social sciences which answer the question *under the simplifying assumption the null hypothesis is true, what is the probability of observing a result more extreme than the one we did?* Although still widely taught, this provides a very precise answer to a question few researchers are interested in and rejecting the strawman null that the effect of $\mathrm{X}$ is exactly equal to zero--something no respectable researcher actually believes--tells you nothing about your vaguely defined alternative [@Gross2014; @Gill1999]. [^8]: This is only in part because I can't figure out the correct `brms::do_call` syntax to pass both a list of marglik objects *and* model priors to `bridgesampling::post_prob`. ::: panel-tabset ### Preparation ```r # Extract the parameters of interest in each model model_params <- map( bayes_model_fits, ~ variables(.x) %>% str_subset(., "^b_") ) # Build a reference data frame with the positions of the predictors params_pos_df <- tibble( # Model Names model = str_c("M", 1:length(model_params)), # Female Rebels (Including Suicide Bombers) femrebels_pos = map_lgl( model_params, ~ str_which(.x, "b_femrebels$") %>% length() == 1 ), # Female Rebels (Excluding Suicide Bombers) femrebelsexs_pos = map_lgl( model_params, ~ str_which(.x, "b_femrebels_exs") %>% length() == 1 ), # No Female Rebels Predictor reference_pos = map_lgl( model_params, ~ str_which(.x, "b_femrebel*") %>% length() %>% !. )) # Apply names to the model fits list names(bayes_model_fits) <- str_c("M", 1:length(bayes_model_fits)) # Extract the marginal likelihood object from each model main_marglik <- map(bayes_model_fits, ~ .x$criteria$marglik) ``` ### Female Rebels (Including SBs) ```r # Posterior Probabilities for female rebel models femrebels_marglik <- with( params_pos_df, main_marglik[c( which(femrebels_pos == TRUE), which(reference_pos == TRUE) )] ) # Matrix of Relative Posterior Probabilities femrebels_post_probs <- do_call( "post_prob", args = unname(femrebels_marglik), pkg = "bridgesampling" ) # Apply model names to the matrix colnames(femrebels_post_probs) <- names(femrebels_marglik) # Build a tibble for the comparisons femrebels_vs_null_df <- tibble( prob_femrebels = rowSums(femrebels_post_probs[, 1:128]), prob_null = rowSums(femrebels_post_probs[, 129:255]), Inclusion_BF = prob_femrebels/prob_null ) ``` ### Female Rebels (Excluding SBs) ```r # Posterior Probabilities for female rebel models (excluding SBs) femrebels_exs_marglik <- with( params_pos_df, main_marglik[c( which(femrebelsexs_pos == TRUE), which(reference_pos == TRUE) )] ) # Matrix of Relative Posterior Probabilities femrebels_exs_post_probs <- do_call( "post_prob", args = unname(femrebels_exs_marglik), pkg = "bridgesampling" ) # Apply model names to the matrix colnames(femrebels_exs_post_probs) <- names(femrebels_exs_marglik) # Build a tibble for the comparisons femrebels_exs_vs_null_df <- tibble( prob_femrebels_exs = rowSums(femrebels_exs_post_probs[, 1:128]), prob_null = rowSums(femrebels_exs_post_probs[, 129:255]), Inclusion_BF = prob_femrebels_exs/prob_null ) ``` ::: ```{r read-comparisons, echo=FALSE} # Load the comparisons list from disk comparisons <- read_rds("data/comparisons_moderate_priors_ls.rds") # Posterior Probabilities for female rebel models (including SBs) femrebels_post_probs <- comparisons$femrebels_post_probs # Inclusion Probabilities for female rebel models (excluding SBs) femrebels_vs_null_df <- comparisons$femrebels # Posterior Probabilities for female rebel models (excluding SBs) femrebels_exs_post_probs <- comparisons$femrebels_exs_post_probs # Inclusion Probabilities for female rebel models (excluding SBs) femrebels_exs_vs_null_df <- comparisons$femrebels_exs ``` ```{r fig-bayes-comparisons, echo = TRUE} #| label: fig-comparisons-bayes #| fig-width: 18 #| fig-height: 12 #| fig-align: center #| fig-dpi: 300 #| fig-cap: "Inclusion Bayes Factors for Female Combatants" #| # Fancy ggplots because aesthetics bf_femrebels_vs_null <- ggplot( data = femrebels_vs_null_df, aes(x = Inclusion_BF) ) + # Slab stat for the raincloud plot stat_slab( aes(slab_alpha = stat(pdf)), slab_color = "#1BD0D5FF", fill_type = "segments", fill = "#2EF19CFF", show.legend = FALSE, scale = 1 ) + # Dot interval stat for the raincloud plot stat_dotsinterval( point_interval = "median_qi", fill = "#99FE42FF", shape = 23, .width = c(0.68, 0.90), point_size = 6, stroke = 2, slab_color = "#1BD0D5FF", show.legend = FALSE, side = "bottom", layout = "weave", dotsize = 0.75 ) + # Add axis labels labs( title = "Female Combatants vs. Null", x = parse(text = "bold(Inclusion~BF[ij])"), y = "" ) + # Adjust the breaks on the x axis scale_y_continuous(breaks = NULL) + # Adjust the breaks on the y axis scale_x_continuous(breaks = scales::pretty_breaks(n = 6)) # Fancy ggplots because aesthetics bf_femrebels_exs_vs_null <- ggplot( data = femrebels_exs_vs_null_df, aes(x = Inclusion_BF) ) + # Slab stat for the raincloud plot stat_slab( aes(slab_alpha = stat(pdf)), slab_color = "#AB82FF", fill_type = "segments", fill = "#9400D3", show.legend = FALSE, scale = 1 ) + # Dot interval stat for the raincloud plot stat_dotsinterval( point_interval = "median_qi", fill = "#B23AEE", shape = 21, .width = c(0.68, 0.90), point_size = 6, stroke = 2, slab_color = "#AB82FF", show.legend = FALSE, side = "bottom", layout = "weave", dotsize = 0.75 ) + # Add axis labels labs( title = "Female Combatants (Excluding SBs) vs. Null", x = parse(text = "bold(Inclusion~BF[ij])"), y = "" ) + # Adjust the breaks on the x axis scale_y_continuous(breaks = NULL) + # Adjust the breaks on the y axis scale_x_continuous(breaks = scales::pretty_breaks(n = 6)) bf_femrebels_vs_null|bf_femrebels_exs_vs_null ``` ::: a Given the prior assumptions about the universe of effect sizes possible and prior probabilities for the hypothesis that rebel groups in which women participate in combat during wartime are more likely to transition to political parties at wars end, the posterior inclusion probability for female combatants is approximately 0.841 when we include cases in which women only participated as suicide bombers and 0.936 if we code participation only in cases where women served in a combat capacity that was not explicitly limited to suicide bombing. Stated differently, as the inclusion Bayes Factors in figure @fig-comparisons-bayes illustrate, the observed data is between 5.29 and 14.64 times more likely to have arisen from those models that include a predictor for female combatants depending on whether we consider cases in which women only participated in combat as suicide bombers or not. This suggests a resonable degree of evidence in support the of the expectation that women's participation in combat has a small to moderate effect on the probability of rebel to party transitions. ::: ## Model Averaged Marginal Effects Turning to the substantive effect estimate, I illustrate here three different approaches to estimating model averaged marginal effects. When fitting each of the models in the Stan program above, we applied Bayesian g-formula to estimate the population average treatment effect in each of the models containing the female combatants predictor and we can retreive those directly from the `brmsfit` objects. The `model_averaged_ame` function shown below implements the approach to averaging over a matrix of posterior probability weights outlined in equation @eq-logit-bmame-reps along with the simpler approach from equation @eq-logit-binary-bmame. ::: panel-tabset ### Functions ```r #' Estimating Model Averaged Marginal Effects and Contrasts for Bayesian Models #' #' This function facilitates the estimation of model averaged and stacked #' marginal effects for Bayesian models and is designed to account for uncertainty #' in the process of model specification and selection when estimating average #' marginal effects and probability contrasts. #' #' @aliases model_averaged_ame #' #' @import data.table #' #' @param x A datatable of posterior draws for the average marginal effects #' or predictions such as that returned by the `posterior_contrasts` function. #' #' @param weights An \eqn{n \times m} matrix of posterior probability weights #' such as that returned by `bridgesampling::post_prob` or an m length vector #' of stacking or pseudo-BMA weights. #' #' @param summary A logical argument indicating whether to return the full #' draws for each row in `weights` or return the average for each model-draw #' pair. Defaults to `FALSE` # #' @param ... Additional arguments for future development, currently unused. #' #' @return A datatable containing a weighted average of the posterior draws #' for the model averaged marginal effects. #' #' @export model_averaged_ame #' model_averaged_ame <- function(x, weights, ndraws, summary = TRUE, ...) { # Nest the draws data frame draws <- x[, list(x.yz = list(.SD)), by=model] # If weights are a matrix, calculate the draws by row if (is.matrix(weights)) { # Construct a matrix of draw weights weighted_draws <- apply( weights, MARGIN = 1, FUN = function(x) { brms:::round_largest_remainder(x * ndraws) }) # Initialize a list to store the draws in out <- list() # Loop over each column in the matrix for (i in 1:ncol(weighted_draws)) { # Randomly sample n rows for each model based on the weights draw_ids <- lapply(weighted_draws[, i], function(x){ sample(sum(weighted_draws[,i]), size = x) }) # Randomly sample n draws proportional to the weights out[[i]] <- lapply(seq_along(draws[, x.yz]), function(id) { draws[, x.yz[[id]][draw_ids[[id]]]] }) # Bind the draws into a single data table per repetition out[[i]] <- rbindlist(out[[i]], idcol = "model") } # Combine everything into a single data table out <- rbindlist(out, idcol = "bridge_rep") # Average over the bridge sampling repetitions by draw if (isTRUE(summary)) { out = out[ , keyby = .(.draw, model), lapply(.SD, mean), .SDcols = patterns("EY|AME") ] } } else if (is.vector(weights)) { # Construct a vector of draw weights weighted_draws <- brms:::round_largest_remainder(weights * ndraws) # Randomly sample n rows for each model based on the weights draw_ids <- lapply(weighted_draws, function(x){ sample(sum(weighted_draws), size = x) }) # Randomly sample n draws proportional to the weights out <- lapply(seq_along(draws[, x.yz]), function(id) { draws[, x.yz[[id]][draw_ids[[id]]]] }) # Combine everything into a single data table out <- rbindlist(out, idcol = "model") } # Return the model averaged draws return(out) } ``` ### Female Rebels (Including SBs) ```r # Set the seed here for reproducibility set.seed(2023) # Extract the AMEs from the models containing femrebels femrebels_ame_boot <- map( .x = with( params_pos_df, bayes_model_fits[which(femrebels_pos == TRUE)] ), ~ .x$predictions ) %>% # Append the list of data tables into one rbindlist(., idcol = "model") # Posterior Probabilities for female rebel models femrebels_bma_marglik <- with( params_pos_df, main_marglik[which(femrebels_pos == TRUE)] ) # Matrix of Relative Posterior Probabilities for BMA femrebels_bma_post_probs <- do_call( "post_prob", args = unname(femrebels_bma_marglik), pkg = "bridgesampling" ) # Apply model names to the matrix colnames(femrebels_bma_post_probs) <- names(femrebels_bma_marglik) # Calculate the weighted draws for each model femrebels_bmame <- model_averaged_ame( femrebels_ame_boot, weights = femrebels_bma_post_probs, ndraws = 30e3, summary = TRUE ) # Print the first few rows head(femrebels_bmame) ``` ```{r, echo=FALSE} # Read in the bmames from disk femrebels_bmame <- read_parquet("data/femrebels_bmame_bayesboot.gz.parquet") # Print the first few rows head(femrebels_bmame) ``` ### Female Rebels (Excluding SBs) ```r # Extract the AMEs from the models containing femrebels (excluding SBs) femrebels_exs_ame_boot <- map( .x = with( params_pos_df, bayes_model_fits[which(femrebelsexs_pos == TRUE)] ), ~ .x$predictions ) %>% # Append the list of data tables into one rbindlist(., idcol = "model") # Posterior Probabilities for female rebel (excluding SBs) models femrebels_exs_bma_marglik <- with( params_pos_df, main_marglik[which(femrebelsexs_pos == TRUE)] ) # Matrix of Relative Posterior Probabilities for BMA femrebels_exs_bma_post_probs <- do_call( "post_prob", args = unname(femrebels_exs_bma_marglik), pkg = "bridgesampling" ) # Apply model names to the matrix colnames(femrebels_exs_bma_post_probs) <- names(femrebels_exs_bma_marglik) # Calculate the weighted draws for each model femrebels_exs_bmame <- model_averaged_ame( femrebels_exs_ame_boot, weights = femrebels_exs_bma_post_probs, ndraws = 30e3, summary = TRUE ) # Print the first few rows head(femrebels_exs_bmame) ``` ```{r, echo=FALSE} # Read in the bmames from disk femrebels_exs_bmame <- read_parquet("data/femrebels_exs_bmame_bayesboot.gz.parquet") # Print the first few rows head(femrebels_exs_bmame) ``` ::: ::: a We can then plot the posterior distribution of the BMAMEs using `{ggdist}` as shown below which allows us to nicely depict the uncertainty in the estimates. As @fig-model-averaged-ame-boot illustrates, the model averaged marginal effect estimates for the expected difference in the probability of former combatant group transitioning to a post-conflict political party is positive for both measures. When considering only cases where women served in a combat capacity that was not explicitly limited to suicide bombing, we see a 98.38% probability of the effect being positive and in the expected direction with a posterior median of 0.2232 and 89% credible interval spanning the range 0.0625 and 0.3750. The results are largely identical for the other measure that includes those cases in which female combatants only served as suicide bombers as instances of participation, with a 96.32% probability of the effect being positive and a posterior median of of 0.1875 and an 89% probability the parameter value falls between 0.0268 and 0.3482. Overall, the general conclusion here is rebel groups where women particpate in combat roles during wartime are more likely to form parties and compete in post-conflict elections. ::: ```{r fig-model-averaged-ame-boot, echo = TRUE, dev='svg'} #| label: fig-model-averaged-ame-boot #| fig-width: 14 #| fig-height: 12 #| fig-align: center #| fig-dpi: 300 #| fig-cap: "Model Averaged Marginal Effects and Expectations for the Effect of Female Combatants on the Probability of Post-Conflict Party Transitions" #| # Append these into a single dataframe for plotting bmame_boot_ests <- rbindlist( list( "Female Combtants (Excluding SBs)" = femrebels_exs_bmame, "Female Combtants (Including SBs)" = femrebels_bmame ), idcol = "plot_row" ) # Initialize the ggplot2 object for the model averaged AMEs ggplot(bmame_boot_ests, aes(y = AME, x = plot_row)) + # Violin plot looking things stat_eye( aes( slab_alpha = after_stat(pdf), slab_fill = stat(y > 0), point_fill = plot_row, shape = plot_row ), slab_size = 2, fill_type = "gradient", point_interval = "median_qi", .width = c(0.68, 0.89), point_size = 6, stroke = 2, show.legend = FALSE, scale = 0.7 ) + # Adjust the y axis scale scale_y_continuous( breaks = scales::pretty_breaks(n = 10), limits = c(-0.4, 0.65) ) + # Adjust fill scale scale_fill_manual( values = c("#23C3E4FF", "#C82803FF"), aesthetics = "slab_fill" ) + # Adjust fill scale scale_fill_manual( values = c("#3AA2FCFF", "#FD8A26FF"), aesthetics = "point_fill" ) + # Adjust shape scale scale_shape_manual(values = c(22, 21)) + # Add labels to the plot labs( y = parse(text = "bold('Model Averaged AME '*Delta)"), x = "" ) ``` ::: a Although this is the ideal approach since it accounts for the additional source of computational uncertainty introduced by approximating the marginal likelihood, it can be expensive in terms of both computational time and memory requirements in addition to requiring a minimal knowledge of Stan. We can sidestep Stan and focus exclusively on `{brms}` by defining a simple R function to calculate contrasts and model averaged marginal effects as shown below. As figure @fig-model-averaged-ame-alt illustrates, the estimates here are virtually identical and do not differ substantively from those in figure @fig-model-averaged-ame-boot. This estimation process is reasonably straightforward and can be used for any model supported by `{brms}`. ::: ::: panel-tabset ### Functions ```r #' Posterior Contrasts for Bayesian Models fit with {brms} #' #' @param models A named list of model objects of class `brmsfit` for which to #' generate predictions. #' #' @param X A string indicating the name of the treatment in the models. #' #' @param data Data used to fit the model objects. #' #' @param contrasts Values of `X` at which to generate predictions. Defaults to #' `c(0, 1)` but could also be set to factor levels. #' #' @param cores Number of cores to use for parallel computation. If cores > 1, #' future::multisession is used to speed computation via `{furrr}`. Requires #' the `{furrr}` package. #' #' @param ... Additional arguments passed down to `brms::posterior_predict` #' #' @return A list of posterior contrasts for each model in `models` #' #' @export posterior_contrasts #' posterior_contrasts <- function(models, X, data, contrasts = c(0, 1), cores = 1, ...) { # Check that all elements of models are of class brmsfit for (i in seq_along(models)) { brms_check <- brms::is.brmsfit(models[[i]]) stopifnot( "All elements in models must be of class brmsfit" = brms_check, brms_check ) } # Y_i(X = 0, Z) lo <- data |> mutate("{X}" := contrasts[1]) # Y_i(X = 1, Z) hi <- data |> mutate("{X}" := contrasts[2]) # Check if cores > 1 if (cores > 1) { # Requires {furrr} require(furrr) # Fit models in parallel via future plan(tweak(multisession, workers = cores)) # Posterior predictions for each data set predictions <- furrr::future_map( .x = models, ~ .get_posterior_predictions(hi, lo, model = .x, ...), .options = furrr::furrr_options( scheduling = 1, seed = TRUE, prefix = "prefix" ), .progress = TRUE ) # Close the future session plan(sequential) } if (cores == 1) { # Posterior predictions for each data set predictions <- purrr::map( .x = models, ~ .get_posterior_predictions(hi, lo, model = .x, ...) ) } # Return the list of predictions return(predictions) } # Helper function to get posterior prediction contrasts .get_posterior_predictions <- function(hi, lo, model, ...) { # Predictions for Y_i(X = 1, Z) EYX1 <- brms::posterior_predict(model, newdata = hi, ...) # Predictions for Y_i(X = 0, Z) EYX0 <- brms::posterior_predict(model, newdata = lo, ...) # Average over the observations for each draw in the prediction matrix out <- data.table( EYX1 = rowMeans(EYX1), EYX0 = rowMeans(EYX0), AME = rowMeans(EYX1 - EYX0), .draw = 1:ndraws(model) ) # Return just the average contrast return(out) } ``` ### Female Rebels (Including SBs) ```r # Models containing femrebels femrebels_mods <- with( params_pos_df, bayes_model_fits[which(femrebels_pos == TRUE)] ) # Generate posterior expectations for the AMEs femrebels_ame_draws <- posterior_contrasts( models = femrebels_mods, X = "femrebels", data = femrebels_mods[[1]]$data, contrasts = c(0, 1), cores = 4 ) # Calculate the weighted draws for each model femrebels_bmame_draws <- model_averaged_ame( femrebels_ame_draws, weights = femrebels_bma_post_probs, ndraws = 30e3, summary = TRUE ) # Print the first few rows head(femrebels_bmame_draws) ``` ```{r, echo=FALSE} # Read in the file from disk femrebels_bmame_draws <- read_parquet("data/femrebels_bmame_direct_contrast.gz.parquet") # Print the first few rows head(femrebels_bmame_draws) ``` ### Female Rebels (Excluding SBs) ```r # Models containing femrebels femrebels_exs_mods <- with( params_pos_df, bayes_model_fits[which(femrebelsexs_pos == TRUE)] ) # Generate posterior expectations for the AMEs femrebels_exs_ame_draws <- posterior_contrasts( models = femrebels_exs_mods, X = "femrebels_exs", data = femrebel_exs_mods$M112$data, contrasts = c(0, 1), cores = 4 ) # Calculate the weighted draws for each model femrebels_exs_bmame_draws <- model_averaged_ame( femrebels_exs_ame_draws, weights = femrebels_exs_bma_post_probs, ndraws = 30e3, summary = TRUE ) # Print the first few rows head(femrebels_exs_bmame_draws) ``` ```{r, echo=FALSE} # Read in the file from disk femrebels_exs_bmame_draws <- read_parquet("data/femrebels_exs_bmame_direct_contrast.gz.parquet") # Print the first few rows head(femrebels_exs_bmame_draws) ``` ::: ```{r fig-model-averaged-ame-alt, echo = TRUE, dev='svg'} #| label: fig-model-averaged-ame-alt #| fig-width: 16 #| fig-height: 14 #| fig-align: center #| fig-dpi: 300 #| fig-cap: "Model Averaged Marginal Effects and Expectations for the Effect of Female Combatants on the Probability of Post-Conflict Party Transitions, Direct Estimation" #| femrebels_bmame_plot_alt <- femrebels_bmame_draws |> # Initialize the ggplot2 object for the model averaged AME ggplot(aes(x = AME)) + # Add a slab interval to plot the weighted posterior stat_halfeye( aes( slab_alpha = after_stat(pdf), slab_fill = stat(x > 0) ), slab_size = 2, slab_colour = "#1BD0D5FF", shape = 22, fill_type = "gradient", point_interval = "median_qi", .width = c(0.68, 0.89), point_size = 6, stroke = 2, point_fill = "#99FE42FF", show.legend = FALSE ) + # Add labels to the plot labs( x = parse(text = "bold('Model Averaged AME '*Delta)"), y = "Posterior Density", title = "Female Combtants (Including Suicide Bomber Roles)" ) + # Adjust fill scale scale_fill_manual( values = c("#208820", "#0060B0"), aesthetics = "slab_fill" ) + # Adjust the x axis scale scale_x_continuous( breaks = scales::pretty_breaks(n = 10), limits = c(-0.4, 0.68) ) + # Adjust the y axis scale scale_y_continuous(breaks = scales::pretty_breaks(n = 6)) femrebels_exs_bmame_plot_alt <- femrebels_exs_bmame_draws |> # Initialize the ggplot2 object for the model averaged AME ggplot(aes(x = AME)) + # Add a slab interval to plot the weighted posterior stat_halfeye( aes( slab_alpha = after_stat(pdf), slab_fill = stat(x > 0) ), slab_size = 2, slab_colour = "#1BD0D5FF", shape = 22, fill_type = "gradient", point_interval = "median_qi", .width = c(0.68, 0.89), point_size = 6, stroke = 2, point_fill = "#99FE42FF", show.legend = FALSE ) + # Add labels to the plot labs( x = parse(text = "bold('Model Averaged AME '*Delta)"), y = "Posterior Density", title = "Female Combtants (Excluding Suicide Bomber Roles)" ) + # Adjust fill scale scale_fill_manual( values = c("#208820", "#0060B0"), aesthetics = "slab_fill" ) + # Adjust the x axis scale scale_x_continuous( breaks = scales::pretty_breaks(n = 10), limits = c(-0.4, 0.68) ) + # Adjust the y axis scale scale_y_continuous(breaks = scales::pretty_breaks(n = 6)) (femrebels_bmame_plot_alt / femrebels_exs_bmame_plot_alt) ``` ## Illustration in {marginaleffects} Finally, in the interest of encouraging applied researchers to take uncertainty seriously and integrate sensitivity analysis into their work directly rather than treating it as something to undertake only when forced, I turn now to a straightforward illustration of BMAMEs using the `{marginaleffects}` package which provides the necessary functionality to handle most of the above steps automatically thanks to the feature-rich support for various approaches to averaging across posterior distributions provided by `{brms}`. ::: a To obtain the BMAME for a given parameter while accounting for the uncertainty in the model specifications, version 0.5.0 and higher of `{marginaleffects}` allows users to specify the argument `type = "average"` in their call to either `marginaleffects::marginaleffects` or `marginaleffects::comparisons` for objects of class `brmsfit` along with any additional arguments to be passed down to `pp_average` such as the type of weights to estimate, or alternatively a numeric vector of pre-estimated weights which is usually the more computationally efficient option and the approach I take in the example below. Although up to this point I have been averaging over all possible combinations of models containing the parameter of interest, we will take a step back here and focus on the full model specifications, averaging over two models that differ only in which of the female combatants measures they contain. I begin by specifying and estimating each of the models using `{brms}`. ::: ::: panel-tabset ### Specification ```r # Prepare the data, we need the two variables to be named the same model_data <- list( "femrebels" = model_df %>% mutate(femrebel = femrebels), "femrebels_exs" = model_df %>% mutate(femrebel = femrebels_exs) ) # Model formula for the female rebels models femrebel_bf <- bf( elect_party ~ femrebels + left_ideol + nationalist + un_intervention + party_affil + ethnic_group + un_intervention + africa + qsi_pol + qsi_econ + qsi_social ) # Specify priors for the model parameters femrebel_logit_priors <- prior(normal(0, 0.75), class = b) + prior(normal(0, 1), class = Intercept) # Define the custom stanvars for the bootstrapped g-formula for comparison purposes gformula_vars <- stanvar( x = 1, name = "treat_pos", block = "data", scode = "int<lower = 1, upper = K> treat_pos; // Treatment Position" ) + # Code for the transformed data block stanvar( scode = " // Bootstrap Probabilities vector[N] boot_probs = rep_vector(1.0/N, N); // Matrices for the bootstrapped predictions matrix[N, Kc] Xa; matrix[N, Kc] Xb; // Potential Outcome Y(X = 1, Z) Xa = X[, 2:K]; Xa[, treat_pos] = ones_vector(N); // Potential Outcome Y(X = 0, Z) Xb = X[, 2:K]; Xb[, treat_pos] = zeros_vector(N); ", position = "end", block = "tdata" ) + # Code for the bootstrapped g-formula in generated quantities stanvar( scode = " // Row index to be sampled for bootstrap int row_i; // Calculate Average Marginal Effect in the Bootstrapped sample real AME = 0; array[N] real Y_X1; // Potential Outcome Y(X = 1, Z) array[N] real Y_X0; // Potential Outcome Y(X = 0, Z) for (n in 1:N) { // Sample the Baseline Covariates row_i = categorical_rng(boot_probs); // Sample Y(x) where x = 1 and x = 0 Y_X1[n] = bernoulli_logit_rng(b_Intercept + Xa[row_i] * b); Y_X0[n] = bernoulli_logit_rng(b_Intercept + Xb[row_i] * b); // Add Contribution of the ith Observation to the Bootstrapped AME AME = AME + (Y_X1[n] - Y_X0[n])/N; } // Take the mean of the posterior expectations real EYX1 = mean(Y_X1); // E[Y | X = 1, Z] real EYX0 = mean(Y_X0); // E[Y | X = 0, Z] ", position = "end", block = "genquant" ) ``` ### Estimation ```r # Fit each of the models (6 chains, 8k iterations), takes about 3 seconds femrebels_logit_fits <- map2( .x = model_data, .y = str_c("Party_Transition_Logit_", names(model_data), ".rds"), .f = ~ brm( formula = femrebel_bf, family = brmsfamily("bernoulli", link = "logit"), prior = femrebel_logit_priors, data = .x, cores = 12, # Max number of cores to use for parallel chains chains = 6, # Number of chains, should be at least 4 iter = 8000, # Total iterations = Warm-Up + Sampling warmup = 3000, # Warm-Up Iterations stanvars = gformula_vars, # Custom Stan code to include in the model save_pars = save_pars(all = TRUE), seed = 2023, backend = "cmdstanr", sample_prior = "yes", file = .y ) ) # Add PSIS LOO-CV and Marginal Likelihood femrebels_logit_fits <- map( .x = femrebels_logit_fits, .f = ~ add_criterion( .x, criterion = c("loo", "marglik"), max_iter = 5e3, repetitions = 25, cores = 8, silent = TRUE )) ``` ::: ::: a We can then use `bridgesampling::post_prob` to obtain the relative probability of each model based on the marginal likelihood estimates and `brms::loo_model_weights` to obtain cross-validation based stacking and pseudo-BMA weights. We can then use `comparisons` to estimate the model averaged probability contrasts. ::: ::: panel-tabset ### Model Weights ```r # Get the posterior probability weights postprob_weights <- bridgesampling::post_prob( femrebels_logit_fits[[1]]$criteria$marglik, # Full Model, Including Suicide Bombers femrebels_logit_fits[[2]]$criteria$marglik, # Full Model, Excluding Suicide Bombers prior_prob = c(0.5, 0.5), # This is the default, I'm just being explicit model_names = c("femrebels", "femrebels_exs") ) # Calculate Pseudo-BMA+ Weights pbma_weights <- loo_model_weights( femrebels_logit_fits[[1]], femrebels_logit_fits[[2]], method = "pseudobma", model_names = names(femrebels_logit_fits), BB = TRUE, # Pseudo-BMA+ with Bayesian Bootstrap BB_n = 10000, # Use 10k replications for Bayesian Bootstrap cores = 8L ) # Calculate Stacking Weights stacking_weights <- loo_model_weights( femrebels_logit_fits[[1]], femrebels_logit_fits[[2]], method = "stacking", model_names = names(femrebels_logit_fits), optim_control = list(reltol = 1e-10), cores = 8L ) # Store all the weights in a named list weights_ls <- list( "BMA" = apply(postprob_weights, 2, mean), "Stacking" = stacking_weights, "Pseudo-BMA+" = pbma_weights ) # Print a comparison of the weights print(weights_ls) ``` ```{r, echo = FALSE} # Read in the weights weights_ls <- read_rds("data/example_model_weights.rds") print(weights_ls) ``` ### Comparisons ```r # Calculate model averaged contrasts comparisons_femrebels <- map( .x = weights_ls, ~ comparisons( femrebels_logit_fits[[1]], m1 = femrebels_logit_fits[[2]], # Subsequent models must be named arguments variables = "femrebel", type = "average", transform_pre = "differenceavg", weights = .x, method = "posterior_predict" # Accounts for observation-level uncertainty ) ) # Extract the posterior draws comparisons_femrebels_draws <- map( .x = comparisons_femrebels, ~ posteriordraws(.x, shape = "long") ) # Bind the draws frames into a data table comparisons_femrebels_draws <- rbindlist( comparisons_femrebels_draws, idcol = "type" ) ``` ```{r, echo = FALSE} # Read in the draws comparisons_femrebels_draws <- read_parquet("data/comparisons_femrebels_draws.gz.parquet") ``` ::: ::: a Figure @fig-model-averaged-ame-margins shows a side by side comparison of the weighted average of the contrasts based on different weighting approaches. The results here are largely the same as those above, but we have combined the two measures into a single model-wise mixture distribution and can incorporate what is effectively measurement uncertainty here directly into the quantity of interest. The `{marginaleffects}` implementation is somewhat limted by its dependence on `brms::pp_average` and it is on my list of things to do for the coming year to work out a more efficient approach that allows for the efficiency and flexibility of those illustrated in the preceding section. ::: ```{r fig-model-averaged-ame-margins, echo = TRUE, dev='svg'} #| label: fig-model-averaged-ame-margins #| fig-width: 14 #| fig-height: 12 #| fig-align: center #| fig-dpi: 300 #| fig-cap: "Model Averaged Marginal Effects for the Effect of Female Combatants on the Probability of Post-Conflict Party Transitions, {marginaleffects} Implementation" # Initialize the ggplot2 object for the model averaged AMEs comparisons_femrebels_draws %>% ggplot(aes(y = draw, x = weights)) + # Violin plot looking things stat_eye( aes( slab_alpha = after_stat(pdf), slab_fill = weights, point_fill = weights, shape = weights ), slab_size = 2, fill_type = "gradient", point_interval = "median_qi", .width = c(0.68, 0.89), point_size = 6, stroke = 2, show.legend = FALSE, scale = 0.8 ) + # Adjust the y axis scale scale_y_continuous( breaks = scales::pretty_breaks(n = 10), limits = c(-0.19, 0.54) ) + # Adjust fill scale scale_fill_manual( values = c("#23C3E4FF", "#C82803FF", "#2EF19CFF"), aesthetics = "slab_fill" ) + # Adjust fill scale scale_fill_manual( values = c("#3AA2FCFF", "#FD8A26FF", "#99FE42FF"), aesthetics = "point_fill" ) + # Adjust shape scale scale_shape_manual(values = c(22, 21, 24)) + # Add labels to the plot labs( y = parse(text = "bold('Model Averaged AME '*Delta)"), x = "" ) ``` ## Conclusion This concludes what was hopefully a detailed illustration of model averaged marginal effects with `{brms}`, `{marginaleffects}`, and Stan. The general takeaway here is that selecting a single model is usually the wrong approach and there is seldom any need to do so. While no method of estimation is perfect, BMA and its variants provide a principled, relatively robust way to propagate uncertainty in quantities typically of interest to researchers in the social sciences and can help us avoid unnecessary dichotomies that are, in most cases, false. By applying probability theory to average across posterior distributions when estimating average marginal effects, we can obtain interpretable model averaged estimates for both linear and non-linear regression models. ::: a As far as subsequent developments are concerned, this is effectively the cornerstone of my dissertation work so I am hoping to have at least one subsequent post illustrating and implementing BMAMEs for multilevel models that employ non-linear link functions in the comining weeks. In the meantime, however, remember that nothing good ever comes from mixing odds and ratios or exposing your relatives to unnecessary risks and holding the random effects in a hierarchical model at their mean still gives you a conditional estimate that results in a weak connection between theory and data. Finally, if you catch any mistakes--and I can almost guarentee there are many in a post of this length--either leave a comment or let me know some other way (i.e., twitter). ::: # References

Introduction

Bayesian Model Averaging in Political Science

Bayesian Model Comparison and Model Averaging

Bayesian Model Averaged Marginal Effects

Limitations and Caveats

Marginal Likelihood and Computational Uncertainty

Bayes Factors and Prior Sensitivity

Stacking, and the Open-M Problem

Female Combatants and Post-Conflict Rebel Parties

Data

Model and Priors

Hypothesis Testing

Model Averaged Marginal Effects

Illustration in {marginaleffects}

Conclusion

References

Footnotes

Reuse

Citation

Stacking, and the Open- $M$ Problem