Category: Simulation

Using Simulation to Determine Sample Sizes for a Study of Store Sales

Suppose a client wants to estimate the total sales value of widgets in a large number of stores. To do this, they will survey a sample of that population of stores. You need to provide the client with advice on choosing a suitable sample size.

Unfortunately, the client has little information to help you. They know that there are 1,000 stores that sell widgets. But they have no idea what the average store sales might be. All they know from previous studies is that the sales tend to be very right skew: Most stores sell very few widgets and very few stores sell a lot of widgets.

This is a fairly typical situation. We deal with a lot of sales data at AlignAlytics and typically find sales volumes and values (allowing for sale price to vary between sellers) to be very right-skew. Sales volumes are often well described by a Poisson distribution; A Pareto or a chi-square distribution often works well for sales values.

So, let’s suppose the client tells us that they expect the sales value per store to be distributed something like this:

Histogram of Sales

That looks very much like a chi-square distribution with two degrees of freedom. So we run the following R code:

# Create the distribution function.
r_dist_fn <- function(n) rchisq(n, 2)

# Get the dataframe of confidence intervals.
df_ci <- estimate_ci_from_sample(r_dist_fn, pop_size=c(1000), min_sample=10, max_sample=50, n_simulations=100000, confidence=c(50, 80, 90, 95))

That gives us a dataframe with rows that look like this:

Confidence Intervals Dataframe

This tells us, for instance, that if we use a sample of 20 stores, there is a 90% chance that the total sales in the population of stores is between approximately 72% and 150% of the estimate based on the sample.

Here's a bit of code that graphs that dataframe of confidence intervals:

for (pop in sort(unique(df_ci$population_size))){
   # Subset df_ci by population size and get the confidence intervals calculated for that subset.
   df_ci_sub <- df_ci[df_ci$population_size==pop,]
   confidence_interval <- sort(unique(df_ci_sub$confidence))

   # Create an empty plot of the required size.
   plot(x=c(min(df_ci_sub$sample_size), max(df_ci_sub$sample_size)), 
        y=c(min(df_ci_sub$pop_total_div_estimated_total_ci_lower), max(df_ci_sub$pop_total_div_estimated_total_ci_upper)), 
        main=paste0("Confidence Intervals (", paste(confidence_interval, collapse="%, "), "%) for Population Total / Sample Total (Population: ", pop, ")"),
        type='n', xlab="Sample Size", ylab="Population Total / Sample Total")
   # Loop across the confidence intervals.
   for (ci in confidence_interval){

      # Graph a confidence interval.
      df_ci_sub_sub <- df_ci_sub[df_ci_sub$confidence==ci,]   
      polygon(c(df_ci_sub_sub$sample_size,                            rev(df_ci_sub_sub$sample_size)), 
              c(df_ci_sub_sub$pop_total_div_estimated_total_ci_lower, rev(df_ci_sub_sub$pop_total_div_estimated_total_ci_upper)), 
              col=rgb(0, 0, 1, 0.2), border=TRUE)

   # Draw a horizontal line at y=1.
   lines(y=c(1, 1), x=c(min(df_ci_sub_sub$sample_size), max(df_ci_sub_sub$sample_size)))   

And here's the output:

Confidence bands for Population Total / Sample Total

In the above graph, the widest confidence interval is the 95% interval, the thinnest (closest to the horizontal line at y=1) is the 50% confidence interval.

So, as before, there is a 90% chance that the total sales in the population of stores is between approximately 72% and 150% of the estimate based on the sample:

Confidence bands for Population Total / Sample Total with lines showing 90% interval for a 20 sample size

Using the above graph and the dataframe of confidence intervals, the client should be able to choose a sensible sample size. This will involve balancing the cost of increasing the sample size against the accuracy improvement achieved by doing so.

Finally, here's the estimate_ci_from_sample function used above:

estimate_ci_from_sample <- function(r_dist_fn, pop_size, min_sample, max_sample, n_simulations=1000, confidence=c(50, 80, 90, 95)){
   # Returns a dataframe of confidence intervals for the sum of a population of real numbers (values given by r_dist_fn) divided by
   # the sum of a sample from that population.
   # r_dist_fn:     A function taking one parameter, n, and returning n random samples from a distribution.
   # pop_size:      A vector of population sizes, e.g. c(100, 1000, 2000).
   # min_sample:    If min_sample is in (0, 1) the minimum sample size is a fraction of the population size. If min_sample is a
   #                positive integer, the minimum sample size is a fixed number (= min_sample).
   # max_sample:    If max_sample is in (0, 1) the maximum sample size is a fraction of the population size. If max_sample is a
   #                positive integer, the maximum sample size is a fixed number (= max_sample).
   # confidence:    A vector of the required confidence intervals, e.g. c(50, 80, 90, 95).
   # n_simulations: The number of simulations to run per population size + sample size combination. The higher this is, the more
   #                accurate the results but the slower the calculation. 
   # Useful functions.
   is_int <- function(x) x %% 1 == 0
   sample_int <- function(spl, pop_size) ifelse(is_int(spl), min(spl, pop_size), round(spl * pop_size))
   # Check the min_sample and max_sample parameters.
   if (min_sample <= 0 || (min_sample > 1 && !is_int(min_sample))) stop("min_sample must be in (0, 1) or be an integer in [1, inf).")
   if (max_sample <= 0 || (max_sample > 1 && !is_int(max_sample))) stop("max_sample must be in (0, 1) or be an integer in [1, inf).")
   if (is_int(min_sample) == is_int(max_sample) && max_sample < min_sample) stop("max_sample should be greater than or equal to min_sample.")
   # Create the dataframe to hold the results.
   df_ci <- data.frame()

   for (population_size in pop_size){

      # Determine the sample size range.
      sample_int_min <- sample_int(min_sample, population_size)
      sample_int_max <- sample_int(max_sample, population_size)
      # Yes, it can happen that sample_int_min > sample_int_max, despite the parameter checks, above.
      if (sample_int_min <= sample_int_max){
         for (sample_size in seq(sample_int_min, sample_int_max)){
            cat(paste0("\nCalculating ", n_simulations, " ", sample_size, "-size samples for population size ", population_size, "."))
            # Calculate the pop_total_div_estimated_total vector.
            pop_total_div_estimated_total <- c(NA, n_simulations)
            for (i_sim in 1:n_simulations){
               population <- r_dist_fn(population_size)
               sample_from_pop <- sample(population, sample_size)
               pop_total_div_estimated_total[i_sim] <- sum(population) / (population_size * mean(sample_from_pop))
            # Loop across the required confidence levels.
            for (conf in confidence){
               # Calculate the confidence interval.
               alpha <- (100 - conf) / 100
               ci <- quantile(pop_total_div_estimated_total, probs=c(alpha / 2, 1 - alpha / 2))
               # Add a row to the dataframe.
               df_ci_row <- data.frame(population_size                        = population_size, 
                                       sample_size                            = sample_size, 
                                       confidence                             = conf, 
                                       pop_total_div_estimated_total_ci_lower = ci[1], 
                                       pop_total_div_estimated_total_ci_upper = ci[2])
               df_ci <- rbind(df_ci, df_ci_row) 
         } # Ends sample_size for loop.

      } # Ends if.

   } # Ends population_size for loop.

Posted on May 26, 2016 by Danielle Mosimann