A Fully Bayesian Analysis of Broman’s Socks

Bayesian Discrete Probability

When Karl Broman tweeted about his laundry he likely didn’t imagine that people would still be estimating how many socks he washed 7 years later. In this post my willingness to derive some exact formulae will enable a fully Bayesian, sampling free, approach to laundry quantification.

I belatedly found my way to the puzzle of estimating exactly how many socks Karl Broman washed through Rasmus Bååth’s excellent blog post, which uses the problem to illustrate Approximate Bayesian Computation.

Rasmus wraps-up the post by presenting three potential criticisms of his analysis, of which one is Why use approximate Bayesian computation at all? I’ll take up his challenge and derive an explicit formula for the likelihood function, enabling a Bayesian analysis without the need for sampling methods.

I’ll also propose an alternative model to take into account my personal belief that the Tweet was sent in the knowledge that the twelfth sock was going to break the run of distinct socks!

Assumptions

This post makes some assumptions about you: that you have some experience with (or a willingness to learn) statistics and discrete probability.

I’ll assume that you’re familiar twith the concepts of Bayesian statistics: a model (the likelihood) describes our understanding of how some data is generated, it can be combined with initial assumptions about plausible parameter values (prior distributions) and combining these with observed data leads to refined assumptions (posterior distributions).

To follow the derivation of the likelihood you’ll need to know some common constructs from discrete probability/combinatorics: e.g. understanding of binomial coefficients, and how these relate to counting problems.

Bååth’s Model

Our aim is to estimate the total number of socks that Karl Broman washed given his Tweet that the first 11 removed from the washing machine were distinct.

I’ll use the following notation throughout:

\[ \begin{align*} \bf{\text{Data}}\\ d & = \text{No. successive distinct socks observed before Tweeting}\\ \\ \bf{\text{Parameters}}\\ \rho & = \text{No. pairs of socks in the wash}\\ \sigma & = \text{No. singleton socks in the wash} \\ \end{align*} \]

In the case of the Tweet, \(d = 11\). Using two parameters \(\rho\) and \(\sigma\) allows us to handle the scenario that whilst most socks come in pairs, in some households stray singleton socks are not uncommon. I will however assume socks don’t come in multiples of more than two, rulling out the not-uncommon scenario in which two pairs of identical socks are washed.

The likelihood, \(L(d|\rho,\sigma)\), describes how the unknown parameters generate the obsered data: Assuming there were \(\rho\) pairs of socks and \(\sigma\) singletons how likely is it that the first \(d\) socks are all distinct?

As a warm up for deriving a complete formula, I’ll consider some of the edge cases which have logical heuristics:

Scenario Heuristic Likelihood \(L(d|\rho,\sigma)\)
\(\rho, \, \sigma < 0\) Let’s not be silly: you can’t have negative socks. 0
\(d > 2\rho + \sigma\) We can’t observe more socks than were washed. 0
\(d > \rho + \sigma\) We can’t observe more distinct socks than the number of distinct socks that were washed. 0
\(\rho = 0, \, \sigma \geq d\) If all the socks were different, then of course all the observed socks are different. 1

With the edge cases handled, I’ll press on and handle the substantive problem.

Likelihood \[ L(\rho,\sigma|d) = \binom{2\rho + \sigma}{d}^{-1} \sum_{j=0}^{\sigma} 2^{d-j} \binom{\sigma}{j} \binom{\rho}{d-j} \]

Examples \[ L(\rho,\sigma|d) = \binom{2\rho + \sigma}{d}^{-1} \sum_{j=0}^{\sigma} 2^{d-j} \binom{\sigma}{j} \binom{\rho}{d-j} \]

As a soft check that the formula above is correct, let’s take a look at some specific cases.

Example: \(\bf{\rho = 1, \, \sigma = 1, \,d = 2}\).

This is the smallest non-trivial scenario, and we can check this easily by hand. Denoting the socks by {S,P1,P2}, there are three ways to choose two of them: {S,P1}, {S,P2} and {P1,P2}. In two of the scenarios the socks are distinct, so the likelihood is 2/3.

Plugging the parameters/data into the formula above: \[ \begin{align} L(2,1|2) & = \binom{3}{2}^{-1} \left\{ 2^2 \binom{1}{0}\binom{1}{2} + 2^1 \binom{1}{1}\binom{1}{1}\right\} = 3^{-1}\left(0 + 2\right) = \frac23 \end{align} \]

Example: \(\bf{\rho=3,\,\sigma=4, \, d = 4}\)

Whilst this example doesn’t sound much more complex, crunching numbers directly would be pretty tedious (admittedly, tricky) as the denominator of the likelihood formula suggests there are 210 scenarios to check.

Evaluating the formula for these parameters indicates the likelihood is 129/210 ~ 0.614.

To validate this claim, I’ll simulate drawing 4 socks from 3 pairs and 4 singletons, and calculate the proportion of these simulations that return distinct socks.

Show code
set.seed(1414214)

rho <- 3
sigma <- 4
d <- 4

# vector of all the socks
all_socks <- c(rep(paste0("P",1:rho), 2), paste0("S", 1:sigma))

# a function to sample d socks without replacement from all_socks, and return
# 1 if all socks are distinct, and 0 otherwise
sample_socks <- function(all_socks, d){
  sock_sample <- sample(x = all_socks, size = d, replace = FALSE)
  return( 1 * (length(sock_sample) == length(unique(sock_sample))) )
}

# draw samples 
draws <- tibble(draw = map_dbl(1:1e05, ~sample_socks(all_socks, d)))
Summary
No. Draws 100,000
No. All Distinct 61,360
Prob. All Distinct 0.614

Proof

\[ L(\rho,\sigma|d) = \binom{2\rho + \sigma}{d}^{-1} \sum_{j=0}^{\sigma} 2^{d-j} \binom{\sigma}{j} \binom{\rho}{d-j} \]

Proof

In words, the likelihood is given by the following fraction

\[ \frac{\text{No. ways to choose d }{\bf{distinct}}\text{ socks from $\rho$ pairs and $\sigma$ singletons.}}{\text{No. ways to choose d socks from $\rho$ pairs and $\sigma$ singletons.}} \]

Starting with the denominator, this is none other than the total number of ways to choose \(d\) objects without replacement from a total of \(2 \rho + \sigma\) (the factor of two is because \(\rho\) pairs of socks equates to \(2 \rho\) individual socks). That is given by the binomial coefficient \(\binom{2\rho + \sigma}{d}\), and explains the leading term in the likelihood formula above.

Turning to the numerator, I’ll break this down by conditioning on the number of singleton socks. counting:

\[\text{No. of ways to choose $d$ distinct socks, given that $j$ of them are singletons.}\]

Starting with the sigletons, there are \(\binom{\sigma}{j}\) ways to choose exactly \(j\) of these. The remaining \(d-j\) socks need to come from the pairs, there are \(\rho\) distinct socks that form \(2\rho\) pairs, so there are \(\binom{\rho}{d-j}\) ways to choose the type of socks. But then for each of these \(d-j\) socks we need an additional factor of 2 as we could have chosen between two (left, and right) socks. Bringing this together, we have:

\[2^{d-j}\binom{\rho}{d-j}\binom{\sigma}{j}.\]

The full formula for numerator, and then the likelihood, follows by summing over the possible values of \(j = 0,\ldots,\sigma\).

For a similar proof, I previously posted an answer to a question on Cross Validated with a slightly different sock related problem.

In practice when evaluating the likelihood (and later the posterior distribution) there are some computational tricks we employ to avoid running into problems of integer overflow; these are detailed in the end-notes.

In the plot below we visualise the likelihood for the case of interest, \(d = 11\).

Show code
# log likelihood for Baath's model
baath_log_likelihood <- function(rho, sigma, d){
  
  # it is not possible to choose more than p+s distinct socks
  if(d > rho + sigma) return(-Inf)
  
  # log summation terms, for the log-sum-exp trick.
  log_summation_terms <- purrr::map(0:min(d, sigma), function(j){
    (d-j)*log(2) + lchoose(sigma,j) + lchoose(rho,d-j)
  })
  
  log_likelihood <- -lchoose(2*rho + sigma,d) +  logSumExp(log_summation_terms)
  
  return(log_likelihood)
}

# compute grid of likelihood values, fix d = 11
# calculate_sock_grid defined in the Utility Functions appendix
baath_likelihood_grid <- 
  calculate_sock_grid(
    rho_max = 30, sigma_max = 20,
    log_likelihood = function(rho,sigma){baath_log_likelihood(rho,sigma,d=11)}
  )

# plot likelihood grid
# plot_sock_grid defined in the Utility Functions appendix
plot_sock_grid(baath_likelihood_grid, var = likelihood)

The plot demonstrates why a frequentist, maximum likelihood estimate (MLE) approach to solving this problem is guaranteed to give unsatisfactory results: insisting that the most likely scenario is that all of the socks in the wash were unique, and being non-committal about how many there were in total (the likelihood is maximised simultaneously for all scenarios where \(\sigma \geq 11\), and \(\rho = 0\)).

Even introducing an assumption that there’s at least one pair is unsatisfactory: in this case the maximum is never achieved as the likelihood increases as \(\rho,\sigma \rightarrow \infty\).

From Prior to Posterior

To avoid getting stuck in trivial edge scenarios, I’ll introduce a prior distribution over \((\rho,\sigma)\), and conduct a Bayesian Analysis. The prior captures our beliefs about the number of socks in the washing machine, in absence of any data.

I’ll put separate priors on the number of singleton and pairs of socks, and then add a restriction that the total socks can’t exceed a fixed (large) amount. This differs from Rasmus’ approach, but produces formulae that are easier to work with: since Rasmus was using sampling, complexity in the formulae wasn’t an issue.

Capping the maximum number of socks has a natural interpretation as we know there are limits on how many socks could fit in a domestic washing machine. More importantly the cap is essential for computing the normalising constant that crops up when using Bayes rule. To dig into that in more detail, I’ll denote \(p(\rho,\sigma)\) for the prior distribution, \(p(\rho, \sigma) |d)\) for the posterior distribution, and \(L(d|\rho,\sigma)\) the likelihdood as before, then Bayes rule gives

\[p(\rho,\sigma | d) = Z^{-1}L(d | \rho, \sigma) p(\rho,\sigma), \qquad \text{where } Z = \sum_{\rho,\sigma} L(d | \rho, \sigma) p(\rho,\sigma).\]

For arbitrary choices of priors, we cannot expect to be able to derive a formula for \(Z\) directly - and thefore we need to evaluate it computationally. By restricting \(\rho, \,\sigma\) to a finite range we can calculate \(Z\) without having to make estimates of the tail behaviour.

The prior distribution on \(\rho\) and \(\sigma\) should be a discrete distribution, on the positive integers. Like Rasmus I’ll use Negative Binomial distributions, which are a flexible generalisation of the Poisson distribution, allowing us to separately control the mean and variance.

I’ll choose parameters for the Negative Binomial distributions so that the marginal distributions of the product prior closely match those of Rasmus’ prior, which he derived based on assumptions about the plausible amount of washing a family might produce. This leads to me using the priors:

\[\rho \sim \text{NegBinom}\left(3 \frac{1}{4}, \frac{1}{5}\right), \qquad \sigma \sim \text{NegBinom}\left(2, \frac{1}{3}\right),\]

I’ll cap the maximum number of socks that could plausibly have been washed at 300; to justify this: Googling the average weight of a sock returns a range of estimates, but low estimates said a sock weighs 50g. 300 socks would therefore come in at 15kg, which is the maximum drum size of any domestic washing machine I could find on Amazon!

Writing out the prior explicitly gives:

\[p(\rho,\sigma) = \textstyle \mathbf{\large 1}_{2\rho + \sigma \leq 300} \, \times \, \binom{\rho + 2\frac{1}{4}}{\rho} \left(1 - \frac{1}{5}\right)^{3 \frac{1}{4}} \left(\frac{1}{5}\right)^{\rho} \,\,\times\,\, \binom{\sigma + 1}{\sigma} \left(1 - \frac{1}{3}\right)^{2} \left(\frac{1}{3}\right)^{\sigma}\]

The plots/table below compare samples from the product prior and the prior Rasmus uses.

Marginals

Measure Prior 10% 25% 50% 75% 90%
Total socks Bååth 13 19 28 38 50
Product 12 18 27 39 52
Pairs of socks Bååth 5 8 12 17 22
Product 4 7 12 17 24
Singleton socks Bååth 1 2 3 5 8
Product 0 1 3 6 9
Proportion single Bååth 0.03 0.06 0.1 0.16 0.22
Product 0 0.05 0.12 0.23 0.37

Density

The marginal distributions align closely, with the exception that the product prior has larger tails for the proportion of singleton socks. This is visible in the (empirical) density plot where Rasmus’ prior clearly has a sharper peak, and is less dispersed in the \(\sigma\)-axis.

The density plot also shows that the product prior is smoother over the parameter space - this is to be expected as defining Rasmus’ prior requires a rounding/floor calculation (which is itself not smooth) to separate the total socks into pairs/singletons.

Finally I will go ahead and derive the posterior using Bayes rule. The animation below shows how the posterior distribution changes with the observation of each additional distinct sock, up to \(d=11\).

Show code
# define the log product prior
log_product_prior <- function(rho, sigma, r_rho, p_rho, r_sigma, p_sigma, cap){

  if(2*rho + sigma > cap){
    log_prior <- -Inf
  }
  else{
    log_prior <- dnbinom(rho, size = r_rho, prob = p_rho, log = TRUE) +
      dnbinom(sigma, size = r_sigma, prob = p_sigma, log = TRUE)
  }
  
  return(log_prior)
}

# calculate posterior grid for each of d = 0,...,11 for animation
iterated_posterior_grid <- map(0:11, .f =function(d){
  
  if(d == 0){
    posterior_grid <- calculate_sock_grid(150, 300,
        log_likelihood = NULL,
        log_prior = function(rho,sigma){
          log_product_prior(rho,sigma, r_rho, p_rho, r_sigma, p_sigma, cap = 300)
          }
      ) %>%
      filter(2*rho+sigma <= 300) %>%
      mutate(posterior =prior) %>%
      add_column(d = d, .before = 0)
  } else {
    posterior_grid <- calculate_sock_grid(100, 100,
        log_likelihood = function(rho,sigma){baath_log_likelihood(rho,sigma,d=d)},
        log_prior = function(rho,sigma){
          log_product_prior(rho,sigma, r_rho, p_rho, r_sigma, p_sigma, cap = 300)
          }
      ) %>%
      filter(2*rho+sigma <= 300) %>%
      add_column(d = d, .before = 0)
  }
  
  return(posterior_grid)
}) %>% bind_rows()

# actual posterior grid
posterior_grid <- iterated_posterior_grid %>% filter(d==11)

Total Socks

Posterior Grid

Mode 10% 25% 50% 75% 90%
Prior 22 11 17 27 38 52
Bååth’s Posterior 38 27 34 44 56 69

For completeness I’ll replicate Rasmus’ ABC methodology (with his original prior), to comparethe posterior distributions:

Mode 10% 25% 50% 75% 90%
ABC Posterior 35 29 35 43 54 65

The posterior mode (or Maximum a posteriori estimate, MAP) is the most likely number of socks under the posterior distribution, which we find to be 38 socks, with a 90% credible interval of [27, 69].

Fortunately for us there is no mystery to the true answer of the number of socks that were washed, which Karl Broman confirmed to be 45, which is consistent with the posterior estimates:

Stopping Time Model

Rasmus’ model assumes that having drawn the eleventh sock, Karl Broman took to Twitter without checking whether or not the twelfth sock was also distinct.

Call me a sceptic, but I can’t help but think that in fact the twelfth sock was observed, seen to break the run, and the Tweet was made in this knowledge. To finish off this post I’ll derive the likelihood takes into account this additional information, and see how this changes our posterior estimates.

For my model rather than working with the data \(d\), I’ll introduce

\[ \begin{align*} \bf{\text{Data}}\\ f & = \text{No. socks prior to the first duplicate} \end{align*} \] The parameters of the model will remain as before \(\rho, \, \sigma\). To be clear, in this notation the first \(f\) socks are distinct, with sock \(f+1\) matching one of the socks already drawn. So in the case of the Tweet, \(f = 11\).

Likelihood \[ L(\rho,\sigma|f) = \left\{(2\rho + \sigma -f)\binom{2\rho + \sigma}{f}\right\}^{-1} \sum_{j=0}^{\sigma} 2^{f-j} \binom{\sigma}{j} \binom{\rho}{f-j} \left(f-j\right) \]

Examples \[ L(\rho,\sigma|f) = \left\{(2\rho + \sigma -f)\binom{2\rho + \sigma}{f}\right\}^{-1} \sum_{j=0}^{\sigma} 2^{f-j} \binom{\sigma}{j} \binom{\rho}{f-j} \left(f-j\right) \]

Example: \(\bf{\rho=3,\,\sigma=4, \, f = 4}\)

As before I will check the formula by comparing it to an estimate derived by sampling.

For the parameter choices set out above, the likelihood formula evaluates to give 0.210.

Show code
set.seed(1414214)

rho <- 3
sigma <- 4
f <- 4

# vector of all the socks
all_socks <- c(rep(paste0("P",1:rho), 2), paste0("S", 1:sigma))

# a function to sample d socks without replacement from all_socks, and return
# 1 if all socks are distinct, and 0 otherwise
sample_socks <- function(all_socks, f){
  # draw one more sock than f, as stopping time is checked at f+1
  sock_sample <- sample(x = all_socks, size = f + 1, replace = FALSE)

  # is f+1 the stopping time? i.e. the first non-distinct sock
  stopping_time <-  1 *
    # first f socks are distinct
    (length(sock_sample[-1]) == length(unique(sock_sample[-1]))) *
    # last sock is not distinct
    (sock_sample[1] %in% sock_sample[-1])

  return(stopping_time)
}

# draw samples 
draws <- tibble(draw = map_dbl(1:1e05, ~sample_socks(all_socks, f)))
Summary
No. Draws 100,000
No. Stopping Time 20,851
Prob. Stopping Time 0.209

Proof

\[ L(\rho,\sigma|f) = \left\{(2\rho + \sigma -f)\binom{2\rho + \sigma}{f}\right\}^{-1} \sum_{j=0}^{\sigma} 2^{f-j} \binom{\sigma}{j} \binom{\rho}{f-j} \left(f-j\right) \]

Proof

As with my derivation of Bååth’s likelihood, the key will be to condition on the number, \(j\), of singleton songs amongst the first \(f\) drawn. In particular we’re looking to calculate

\[\text{No. ways to choose $f$ distinct socks, followed by a duplicate,}\\ \text{given that $j$ of the first $f$ are singletons.}\]

We know from the previous proof that the number of ways to choose \(f\) distinct socks, with \(j\) singletons is

\[2^{f-j} \binom{\sigma}{j}\binom{\rho}{f-j},\]

to this we just need to multiply the number of ways that we can pick one further sock, that matches one of the first \(f\). Only \(f-j\) of the socks chosen so far have a pair (the others being singletons), so therefore there are \(f-j\) socks we could pick on the next draw that would be a duplicate. Putting this together gives the summation part of the liklihood.

It remains to derive the denominator: in our previous calculation this was the number of ways to pick unordered socks, and was given by a binomial coefficient. This time we want to pick \(f\) socks, for which the order does not matter, and then one additional sock that is singled out as the last sock. That last sock can be chosen from amongst the \(2\rho + \sigma - f\) remaining socks, from which we get the denominator:

\[\binom{2\rho + \sigma}{f}(2\rho + \sigma - f).\]

Recall that Rasmus’ likelihood preferred configurations that either had only singleton socks, or infinitely many socks. This is not the case for the stopping time model: in particular the requirement that a duplicate is drawn rules out the scenario of all singleton socks, and in the limit \(\rho \rightarrow \infty\) the probability of drawing a duplicate becomes negligible.

This later observation gives us some intuition that we can expect the posterior distribution under the stopping time distribution to be focused on configurations with fewer socks than Rasmus’ model. This is confirmed in the posterior summaries below.

Show code
# log likelihood for Baath's model
st_log_likelihood <- function(rho, sigma, f){
  
  # it is not possible to choose more than p+s distinct socks
  if(f > rho + sigma) return(-Inf)
  
  # it is not possible to get a duplicate if all socks are distinct
  if(rho == 0) return(-Inf)
  
  # its not possible to pick more socks than were washed
  if(f+1 > 2*rho + sigma) return(-Inf)
  
  # log summation terms, for the log-sum-exp trick.
  log_summation_terms <- purrr::map(0:min(f, sigma), function(j){
    (f-j)*log(2) + lchoose(sigma,j) + lchoose(rho,f-j) + log(f-j)
  })
  
  log_likelihood <- -lchoose(2*rho + sigma,f) - log(2*rho + sigma - f) +  logSumExp(log_summation_terms)
  
  return(log_likelihood)
}

st_posterior_grid <- calculate_sock_grid(100, 100,
        log_likelihood = function(rho,sigma){st_log_likelihood(rho,sigma,f=11)},
        log_prior = function(rho,sigma){log_product_prior(rho,sigma, r_rho, p_rho, r_sigma, p_sigma, cap = 300)}
      ) %>%
      filter(rho <= 150, sigma <= 300)

Total Socks

Posterior Grid

# A tibble: 10,201 x 8
# Rowwise: 
     rho sigma log_likelihood log_prior log_posterior likelihood
   <int> <int>          <dbl>     <dbl>         <dbl>      <dbl>
 1     0     0           -Inf     -7.43          -Inf          0
 2     0     1           -Inf     -7.14          -Inf          0
 3     0     2           -Inf     -7.14          -Inf          0
 4     0     3           -Inf     -7.26          -Inf          0
 5     0     4           -Inf     -7.44          -Inf          0
 6     0     5           -Inf     -7.66          -Inf          0
 7     0     6           -Inf     -7.91          -Inf          0
 8     0     7           -Inf     -8.19          -Inf          0
 9     0     8           -Inf     -8.47          -Inf          0
10     0     9           -Inf     -8.77          -Inf          0
# … with 10,191 more rows, and 2 more variables: prior <dbl>,
#   posterior <dbl>
Mode 10% 25% 50% 75% 90%
Prior 22 11 17 27 38 52
Bååth’s Posterior 38 27 34 44 56 69
Stopping Time Posterior 34 24 30 38 49 61

Although the posterior mode is lower for the stopping time model, the true answer of 45 socks is still comfortably within the central 50% credible interval.

Comments

Acknowledgments

This post was inspired by Karl Broman’s tweet, and Rasmus Bååth’s analysis.

Computation has been done using R.

Computational Considerations

In practice we use a couple of tricks when computing the likelihood function: avoiding the risk of encountering integer overflow when working with large sums of binomial coefficients.

The first trick is probably familiar: rather than working with the likelihood, we’ll use the log-likelihood. The log-likelihood for Bååth’s model is:

\[l(d|\rho,\sigma) = -\log \textstyle{\binom{2\rho + \sigma}{d}} + \log \bigg(\textstyle \sum_{j=0}^{\sigma} 2^{d-j} \binom{\sigma}{j} \binom{\rho}{d-j}\bigg)\]

If we naively go ahead and calculate that sum and then take logs - then we won’t avoid the risk of overflow at all: we’d still calculate a sum of large integers, and taking the logarithm becomes an after thought.

What we really want to do is take the logarithm of the terms inside the summation - which will return values well within computational comfort zone. That is we want to turn a computation of the form \(\log \left( x_1 + \cdots + x_n\right)\) into a calculation in \(\log(x_1),\ldots, \log(x_n)\).

This is where trick number two comes in: the log-sum-exp trick.

Log-Sum-Exp Trick Given \(x_1,\ldots,x_n > 0\), let \(l_k = \log(x_k)\) and \(l^* = \max_k l_k\). Then

\[ \log(x_1 + \cdots + x_n) = l^* + \log \bigg ( \exp(l_1 - l^*) + \cdots + \exp(l_n - l^*)\bigg) \]

Example Given \(x_1,\ldots,x_n > 0\), let \(l_k = \log(x_k)\) and \(l^* = \max_k l_k\). Then

\[ \log(x_1 + \cdots + x_n) = l^* + \log \bigg ( \exp(l_1 - l^*) + \cdots + \exp(l_n - l^*)\bigg) \]

Example: \(\bf{\rho = 30, \, \sigma = 2, \,d = 11}\).

This particular set of parameters admits a relatively concise likelihood, which will help us to write the complete formulae, starting with the log-likelihood:

\[ l(11 | 2,30) = -\log \textstyle \binom{62}{11} + \log \bigg(2^{11} \binom{2}{0}\binom{30}{11} + 2^{10} \binom{2}{1}\binom{30}{10} + 2^{9} \binom{2}{2}\binom{30}{9} \bigg). \]

If we do this by brute force, using R to evaluate each of the powers, and binomial coefficients this becomes:

\[ \begin{align} l(11|2,30) & = -\log 508271323092 \\ & \qquad + \log \big ( 111876710400 + 61532190720 + 7325260800 \big) \\ & = - 26.95 + 25.92 \\ & = -1.034 \end{align}\]

And back on the natural scale, that implies a likelihood of \(L(11 |2,30) \approx 0.36\).

The bit that feels uncomfortable there is the sum of all those +10 digit numbers - and in real cases we could be summing more terms - each with more digits. So now let’s see how that looks using the log-sum-exp trick.

As per the statement above, we’ll first calculate the log of each of the terms (and include the denominator term, denoted \(A\) here, which we’ll need for the full calculation):

\[ \begin{align} \textstyle \log \binom{62}{11} & = A \approx 26.9543 \\ \textstyle \log 2^{11} \binom{2}{0}\binom{30}{11} &= l_1 \approx 25.4407 \\ \textstyle \log 2^{10} \binom{2}{1}\binom{30}{10} &= l_2 \approx 24.8428 \\ \textstyle \log 2^{9} \binom{2}{2}\binom{30}{9} &= l_3 \approx 22.7146 \\ \end{align} \] The maximum is \(l^* = l_1\), and we can go ahead and apply the trick: \[ \begin{align} l(11|2,30) & = - A + l^* + \log \big ( e^{l_1 - l^*} + e^{l_2 - l^*} + e^{l_3 - l^*}\big) \\ & = -1.514 + \log \big( e^{0} + e^{-0.5978} + e^{-2.726} \big) \\ & = -1.514 + \log \big( 1+ {0.55} + {0.06548} \big) \\ & = -1.514 + 0.4796 \\ & = -1.034 \end{align} \]

As expected we get exactly the same result - with the benefit that the largest value we used in any of the sums was \(A \approx 26.95\).

Proof Given \(x_1,\ldots,x_n > 0\), let \(l_k = \log(x_k)\) and \(l^* = \max_k l_k\). Then

\[ \log(x_1 + \cdots + x_n) = l^* + \log \bigg ( \exp(l_1 - l^*) + \cdots + \exp(l_n - l^*)\bigg) \]

Proof

The trick itself follows some from some fairly routine manipulation of exponentials. In the below, we have \(x_k,\,l_k\) defined as above, and let \(A\) be any constant:

\[ \begin{align*} x_1 + \cdots + x_n & = \exp( \log x_1) + \cdots + \exp( \log x_n) \\ & = \exp(l_1) + \cdots + \exp(l_n) \\ & = \exp(A) \bigg(\exp(l_1 - A) + \cdots + \exp (l_n - A) \bigg) \end{align*} \] Taking logarithms of both sides

\[\log(x_1 + \cdots + x_n) = A + \log \bigg ( \exp(l_1 - A) + \cdots + \exp(l_n - A)\bigg),\] and the trick follows by choosing \(A = l^* = \max_k l_k\).

Importantly subtracting the maximum guarantees that each of the terms to be exponentiated satisfies \(l_k - l^* \leq 0\), and hence each exponential is bounded above by 1, avoiding any calculations at the limits of machine precision.

Bååth’s Prior

Rasmus defined his prior on \(\rho,\,\sigma\) through the following steps:

  1. Suppose the total number of socks, \(\eta\), follows a negative binomial distribution: \(\eta \sim \text{NBinom}(\mu, \tau)\).

  2. Suppose that the proportion of socks that are in pairs, \(\theta\), follows a beta distribution: \(\theta \sim \text{Beta}(\alpha,\beta)\).

  3. Calculate the number of pairs of socks: \(\rho = \left[ \lfloor \eta/2 \rfloor \theta \right]\), where \(\lfloor \, \cdot \, \rfloor\) denotes the floor, and \(\left[ \, \cdot \, \right]\) denotes rounding.

  4. Calculate the number of singleton socks: \(\sigma = \eta - 2\rho\).

For Rasmus this choice doesn’t create an issues, as its perfectly easy to sample from this prior, and hence to conduct ABC. However as we are looking to do exact Bayesian inference, we have to derive the explicit formula for the prior - and that isn’t trivial. It is doable, but leads to a formula that is not particularly intuitive, which was my motivation for working with the product prior.

Bååth’s Prior Let \(P_{\mu,\tau} \sim \text{NBinom}(\mu,\,\tau)\) and \(P_{\alpha,\beta} \sim \text{Beta}(\alpha,\beta)\) then
\[ P(\rho, \sigma) = P_{\mu, \tau}(2\rho + \sigma) \left\{ F_{\alpha,\beta}\left( \frac{2\rho + 1}{2 \lfloor \rho + \sigma/2\rfloor}\right) - F_{\alpha,\beta}\left( \frac{2\rho - 1}{2 \lfloor \rho + \sigma/2\rfloor}\right) \right\}\] where \(F_{\alpha,\beta}(t) = P_{\alpha,\beta}(\theta < t)\) is the CDF of the Beta distribution.

Proof Let \(P_{\mu,\tau} \sim \text{NBinom}(\mu,\,\tau)\) and \(P_{\alpha,\beta} \sim \text{Beta}(\alpha,\beta)\) then
\[ P(\rho, \sigma) = P_{\mu, \tau}(2\rho + \sigma)\left\{F_{\alpha,\beta}\left( \frac{\rho + \textstyle \frac12}{\lfloor \rho + \sigma/2 \rfloor} \right) - F_{\alpha,\beta}\left( \frac{\rho - \textstyle \frac12}{\lfloor \rho + \sigma/2 \rfloor} \right) \right\}\] where \(F_{\alpha,\beta}(t) = P_{\alpha,\beta}(\theta < t)\) is the CDF of the Beta distribution.

An application of the law of total probability gives us:

\[ \begin{align} P(\rho, \sigma) & = \sum_{\eta} P(\rho,\sigma | \eta) P(\eta) \\ & = P(\rho, \sigma | \eta = 2\rho + \sigma) P(\eta = 2\rho + \sigma) \\ & = P_{\mu,\tau}(2\rho + \sigma) P(\rho,\sigma|\eta = 2\rho + \sigma). \end{align} \] The first term is as in the formula above, so now we focus on the conditional term. For this we note

\[ \begin{align} P(\rho,\sigma | \eta) & = P\big( \big[ \lfloor \eta/2 \rfloor \theta \big] = \rho\big) \\ & = P_{\alpha,\beta}\big( \rho - \textstyle \frac12 < \lfloor \eta/2 \rfloor \theta < \rho + \textstyle \frac12 \big) \\ & = P_{\alpha,\beta}\left( \frac{\rho - \textstyle \frac12}{\lfloor \eta/2 \rfloor} < \theta < \frac{\rho - \textstyle \frac12}{\lfloor \eta/2 \rfloor} \right) \\ & = P_{\alpha,\beta}\left( \frac{\rho - \textstyle \frac12}{\lfloor \rho + \sigma/2 \rfloor} < \theta < \frac{\rho - \textstyle \frac12}{\lfloor \rho + \sigma/2 \rfloor} \right) \\ \end{align} \] The final statement above follows since the CDF satisifes \(P_{\alpha,\beta}(a < \theta < b) = F_{\alpha,\beta}(b) - F_{\alpha,\beta}(a)\).

Utility Functions

Show code
# function to calculate grid of log posterior probabilities.
#   - if prior_func = NULL returns the likelihood,
#   - if likelihood_func = NULL returns the prior distribution
calculate_sock_grid <- function(rho_max,sigma_max, log_likelihood = NULL, log_prior = NULL){
    
  if(is.null(log_likelihood) & is.null(log_prior)){
    stop("At least one of log_likelihood or log_prior must be provided")
  }

  # initialise grid
  grid <- crossing(
      rho = 0:rho_max,
      sigma = 0:sigma_max,
      log_likelihood = NA_real_,
      log_prior = NA_real_,
      log_posterior = NA_real_
    ) %>% rowwise()
  

  # populate prior/likelihood
  if(!is.null(log_likelihood)){
    grid <- grid %>% mutate(log_likelihood = log_likelihood(rho,sigma))
  }
  
  if(!is.null(log_prior)){
    grid <- grid %>% mutate(log_prior = log_prior(rho,sigma))
  }
  
  # populate posterior, first calculating without the additive constant, Z, then
  # evaluating this and adding.
  grid <- grid %>% mutate(prop_log_posterior = log_likelihood + log_prior)

  log_Z <- logSumExp(grid$prop_log_posterior)
    
  grid <- grid %>% 
    mutate(log_posterior = -log_Z + prop_log_posterior) %>%
    select(-prop_log_posterior)
    
  # calculate non-log terms
  grid <- grid %>%
    mutate(across(starts_with("log_"), ~exp(.), .names = "exp_{col}")) %>%
    rename_with(~str_remove(.,"exp_log_"))
  
  return(grid)
}

plot_sock_grid <- function(grid, var){
  
  var_name <- rlang::as_name(enquo(var))
  
  if(str_detect(var_name, "log")){
    
    filled_grid <- grid %>% filter(!is.infinite({{var}}))
    empty_grid <- grid %>% filter(is.infinite({{var}}))
  } else {
    filled_grid <- grid %>% filter(({{var}} != 0))
    empty_grid <- grid %>% filter(({{var}} == 0))
  }
  
  p <- ggplot() +
    geom_point(data =filled_grid,aes(x = rho, y = sigma, color ={{var}}), size =4) +
    geom_point(data = empty_grid, aes(x = rho, y = sigma, ), color = "grey", shape = 1, size =4) +
    scale_x_continuous(breaks = seq(0,max(grid$rho),by=10)) +
    scale_y_continuous(breaks = seq(0,max(grid$sigma),by=10)) +
    scale_color_gradientn(
      colours = wesanderson::wes_palette("Zissou1", 1000, type = "continuous"),
      name = "") +
    coord_fixed() + 
    labs(x = "ρ", y = "σ") +
    theme(
      axis.title.y = element_text(angle = 0, vjust = 0.5),
      axis.line = element_blank(), legend.key.width = unit(1.4, "cm")
    )
 
 return(p)
}