Category: R

Using Simulation to Determine Sample Sizes for a Study of Store Sales

Suppose a client wants to estimate the total sales value of widgets in a large number of stores. To do this, they will survey a sample of that population of stores. You need to provide the client with advice on choosing a suitable sample size.

Unfortunately, the client has little information to help you. They know that there are 1,000 stores that sell widgets. But they have no idea what the average store sales might be. All they know from previous studies is that the sales tend to be very right skew: Most stores sell very few widgets and very few stores sell a lot of widgets.

This is a fairly typical situation. We deal with a lot of sales data at AlignAlytics and typically find sales volumes and values (allowing for sale price to vary between sellers) to be very right-skew. Sales volumes are often well described by a Poisson distribution; A Pareto or a chi-square distribution often works well for sales values.

So, let’s suppose the client tells us that they expect the sales value per store to be distributed something like this:

Histogram of Sales

That looks very much like a chi-square distribution with two degrees of freedom. So we run the following R code:

 
# Create the distribution function.
r_dist_fn <- function(n) rchisq(n, 2)

# Get the dataframe of confidence intervals.
df_ci <- estimate_ci_from_sample(r_dist_fn, pop_size=c(1000), min_sample=10, max_sample=50, n_simulations=100000, confidence=c(50, 80, 90, 95))

That gives us a dataframe with rows that look like this:

Confidence Intervals Dataframe

This tells us, for instance, that if we use a sample of 20 stores, there is a 90% chance that the total sales in the population of stores is between approximately 72% and 150% of the estimate based on the sample.

Here's a bit of code that graphs that dataframe of confidence intervals:

 
par(ask=TRUE)
for (pop in sort(unique(df_ci$population_size))){
   
   # Subset df_ci by population size and get the confidence intervals calculated for that subset.
   df_ci_sub <- df_ci[df_ci$population_size==pop,]
   confidence_interval <- sort(unique(df_ci_sub$confidence))

   # Create an empty plot of the required size.
   plot(x=c(min(df_ci_sub$sample_size), max(df_ci_sub$sample_size)), 
        y=c(min(df_ci_sub$pop_total_div_estimated_total_ci_lower), max(df_ci_sub$pop_total_div_estimated_total_ci_upper)), 
        main=paste0("Confidence Intervals (", paste(confidence_interval, collapse="%, "), "%) for Population Total / Sample Total (Population: ", pop, ")"),
        type='n', xlab="Sample Size", ylab="Population Total / Sample Total")
   
   # Loop across the confidence intervals.
   for (ci in confidence_interval){

      # Graph a confidence interval.
      df_ci_sub_sub <- df_ci_sub[df_ci_sub$confidence==ci,]   
      polygon(c(df_ci_sub_sub$sample_size,                            rev(df_ci_sub_sub$sample_size)), 
              c(df_ci_sub_sub$pop_total_div_estimated_total_ci_lower, rev(df_ci_sub_sub$pop_total_div_estimated_total_ci_upper)), 
              col=rgb(0, 0, 1, 0.2), border=TRUE)
      
   }

   # Draw a horizontal line at y=1.
   lines(y=c(1, 1), x=c(min(df_ci_sub_sub$sample_size), max(df_ci_sub_sub$sample_size)))   
}
par(ask=FALSE)

And here's the output:

Confidence bands for Population Total / Sample Total

In the above graph, the widest confidence interval is the 95% interval, the thinnest (closest to the horizontal line at y=1) is the 50% confidence interval.

So, as before, there is a 90% chance that the total sales in the population of stores is between approximately 72% and 150% of the estimate based on the sample:

Confidence bands for Population Total / Sample Total with lines showing 90% interval for a 20 sample size

Using the above graph and the dataframe of confidence intervals, the client should be able to choose a sensible sample size. This will involve balancing the cost of increasing the sample size against the accuracy improvement achieved by doing so.

Finally, here's the estimate_ci_from_sample function used above:

 
estimate_ci_from_sample <- function(r_dist_fn, pop_size, min_sample, max_sample, n_simulations=1000, confidence=c(50, 80, 90, 95)){
   # Returns a dataframe of confidence intervals for the sum of a population of real numbers (values given by r_dist_fn) divided by
   # the sum of a sample from that population.
   #
   # r_dist_fn:     A function taking one parameter, n, and returning n random samples from a distribution.
   # pop_size:      A vector of population sizes, e.g. c(100, 1000, 2000).
   # min_sample:    If min_sample is in (0, 1) the minimum sample size is a fraction of the population size. If min_sample is a
   #                positive integer, the minimum sample size is a fixed number (= min_sample).
   # max_sample:    If max_sample is in (0, 1) the maximum sample size is a fraction of the population size. If max_sample is a
   #                positive integer, the maximum sample size is a fixed number (= max_sample).
   # confidence:    A vector of the required confidence intervals, e.g. c(50, 80, 90, 95).
   # n_simulations: The number of simulations to run per population size + sample size combination. The higher this is, the more
   #                accurate the results but the slower the calculation. 
   
   # Useful functions.
   is_int <- function(x) x %% 1 == 0
   sample_int <- function(spl, pop_size) ifelse(is_int(spl), min(spl, pop_size), round(spl * pop_size))
   
   # Check the min_sample and max_sample parameters.
   if (min_sample <= 0 || (min_sample > 1 && !is_int(min_sample))) stop("min_sample must be in (0, 1) or be an integer in [1, inf).")
   if (max_sample <= 0 || (max_sample > 1 && !is_int(max_sample))) stop("max_sample must be in (0, 1) or be an integer in [1, inf).")
   if (is_int(min_sample) == is_int(max_sample) && max_sample < min_sample) stop("max_sample should be greater than or equal to min_sample.")
   
   # Create the dataframe to hold the results.
   df_ci <- data.frame()

   for (population_size in pop_size){

      # Determine the sample size range.
      sample_int_min <- sample_int(min_sample, population_size)
      sample_int_max <- sample_int(max_sample, population_size)
      
      # Yes, it can happen that sample_int_min > sample_int_max, despite the parameter checks, above.
      if (sample_int_min <= sample_int_max){
      
         for (sample_size in seq(sample_int_min, sample_int_max)){
            
            cat(paste0("\nCalculating ", n_simulations, " ", sample_size, "-size samples for population size ", population_size, "."))
            
            # Calculate the pop_total_div_estimated_total vector.
            pop_total_div_estimated_total <- c(NA, n_simulations)
            for (i_sim in 1:n_simulations){
               population <- r_dist_fn(population_size)
               sample_from_pop <- sample(population, sample_size)
               pop_total_div_estimated_total[i_sim] <- sum(population) / (population_size * mean(sample_from_pop))
            }
            
            # Loop across the required confidence levels.
            for (conf in confidence){
               # Calculate the confidence interval.
               alpha <- (100 - conf) / 100
               ci <- quantile(pop_total_div_estimated_total, probs=c(alpha / 2, 1 - alpha / 2))
               
               # Add a row to the dataframe.
               df_ci_row <- data.frame(population_size                        = population_size, 
                                       sample_size                            = sample_size, 
                                       confidence                             = conf, 
                                       pop_total_div_estimated_total_ci_lower = ci[1], 
                                       pop_total_div_estimated_total_ci_upper = ci[2])
               df_ci <- rbind(df_ci, df_ci_row) 
            }
            
         } # Ends sample_size for loop.

      } # Ends if.

   } # Ends population_size for loop.

   return(df_ci)
}
Posted on May 26, 2016 by Danielle Mosimann

Drawing a Grid of Plots in R — Regression Lines, Loess Curves and More

We provide here an R function that draws a grid of plots, revealing relationships between the variables in a dataset and a given target variable.

Scatterplots in the grid include regression lines, loess curves and the adjusted R-squared statistic.

Boxplots have points indicating the group means. Box widths are proportional to the square-root of the number of observations in the relevant group. The p-value is shown for an F-test: p < 0.05 indicates a significant difference between the means of the groups. But don't take this p-value on faith: Be sure to check the assumptions of the one-way ANOVA model.

Mosaic plots include the p-value of a chi-square test of independence: p < 0.05 indicates that there is a significant relationship between the two variables under consideration. The number of plot cells with a count under five is shown; if this is greater than zero, the chi-square test may be invalid. Here's an example using a continuous target variable:

 
mtcars2 <- mtcars
mtcars2$cyl  <- as.factor(mtcars2$cyl)
mtcars2$vs   <- as.factor(mtcars2$vs)
mtcars2$am   <- as.factor(mtcars2$am)
mtcars2$gear <- as.factor(mtcars2$gear)
mtcars2$carb <- as.factor(mtcars2$carb)
multiplot(mtcars2, 'disp', c(2, 5))

Drawing a Grid of Plots in R

This example has a categorical target variable:

 
multiplot(mtcars2, 'gear', c(2, 5))

Drawing a Grid of Plots in R

Finally, here’s the multiplot function:

 
multiplot <- function(df_data, y_column, mfrow=NULL){
   #
   # Plots the data in column y_column of df_data against every other column in df_data, a dataframe.
   # By default the plots are drawn next to each other (i.e. in a row). Use mfrow to overide this. E.g. mfrow=c(2, 3). 
   #
   
   # Set the layout
   if (is.null(mfrow)) mfrow <- c(1, ncol(df_data) - 1)
   op <- par(mfrow=mfrow, mar=c(5.1, 4.1, 1.1, 1.1), mgp = c(2.2, 1, 0))
   on.exit(par(op))

   for (icol in which(names(df_data) != y_column)){
      x_column <- names(df_data)[icol]
      y_x_formula <- as.formula(paste(y_column, "~", x_column))
      x_y_formula <- as.formula(paste(x_column, "~", y_column))
      x <- df_data[[x_column]]
      y <- df_data[[y_column]]
      subtitle <- ""
      
      if (is.factor(x)){
         if (is.factor(y)){
            # Mosaic plot.
            tbl <- table(x, y)
            chi_square_test_p <- chisq.test(tbl)$p.value
            problem_cell_count <- sum(tbl < 5)
            subtitle <- paste("Chi-Sq. Test P:", round(chi_square_test_p, 3)," (< 5 in ", problem_cell_count, " cells.)")
            plot(y_x_formula, data=df_data)
         } else {
            # Vertical boxplot.
            fit <- aov(y_x_formula, data=df_data)
            f_test_p <- summary(fit)[[1]][["Pr(>F)"]][[1]]
            subtitle <- paste("F-Test P:", round(f_test_p, 3))
            boxplot(y_x_formula, data=df_data, horizontal=FALSE, varwidth=TRUE)
            means <- tapply(y, x, function(z){mean(z, na.rm=TRUE)})
            points(x=means, col="red", pch=18)
         }
      } else {
         if (is.factor(y)){
            # Horizontal boxplot.
            fit <- aov(x_y_formula, data=df_data)
            f_test_p <- summary(fit)[[1]][["Pr(>F)"]][[1]]
            subtitle <- paste("F-Test P:", round(f_test_p, 3))
            boxplot(x_y_formula, data=df_data, horizontal=TRUE, varwidth=TRUE)
            means <- tapply(x, y, function(z){mean(z, na.rm=TRUE)})
            points(x=means, y=1:length(levels(y)), col="red", pch=18)
         } else {
            # Scatterplot with straight-line regression and lowess line.
            adj_r_squared <- summary(lm(y_x_formula, df_data))$adj.r.squared
            subtitle <- paste("Adj. R Squared:", round(adj_r_squared, 3))
            plot(y_x_formula, data=df_data, pch=19, col=rgb(0, 0, 0, 0.2))
            abline(lm(y_x_formula, data=df_data), col="red", lwd=2)
            lines(lowess(x=x, y=y), col="blue", lwd=2) 
         }
      }
      title(sub=subtitle, xlab=x_column, ylab=y_column)
   }
}
Posted on March 21, 2016 by Danielle Mosimann

Seasonal Decomposition of Time Series by Loess—An Experiment

Let’s run a simple experiment to see how well the stl() function of the R statistical programming language decomposes time-series data.

An Example

First, we plot some sales data:

 
sales<-c(39,  73,  41,  76,  75,  47,   4,  53,  40,  47,  31,  33,
         58,  85,  61,  98,  90,  59,  34,  74,  78,  74,  56,  55,
         91, 125,  96, 135, 131, 103,  86, 116, 117, 128, 113, 123)
time.series <- ts(data=sales, frequency = 12, start=c(2000, 1), end=c(2002, 12))
plot(time.series, xlab="Time", ylab="Sales (USD)", main="Widget Sales Over Time")

Observe the annual seasonality in the data:

A time series of widget sales

We apply R's stl() function ("seasonal and trend decomposition using Loess") to the sales data:

 
decomposed  <- stl(time.series, s.window="periodic")
plot(decomposed)

This decomposes the sales data as the sum of a seasonal, a trend and a noise/remainder time-series:

A time series of widget sales decomposed into seasonal, trend and noise/remainder components

We may easily extract the component time series:

 
decomposed <- stl(time.series, s.window="periodic")
seasonal   <- decomposed$time.series[,1]
trend	   <- decomposed$time.series[,2]
remainder  <- decomposed$time.series[,3]

This allows us to plot the seasonally-adjusted sales:

 
plot(trend+remainder,
main="Widget Sales over Time, Seasonally Adjusted",
ylab="Sales (USD)")

A time series of seasonally-adjusted widget sales

An Experiment

How well does stl() extract trend and seasonality from data? We run three simple graphical investigations.

Case 1: Strong seasonality and low, normally-distributed homoskedastic noise

An experiment in decomposition by Loess of a time series showing strong seasonality and low, normally-distributed homoskedastic noise

The left side of each of the above images shows, from top to bottom:

  1. Generated sales data.
  2. The trend component from which the data was generated.
  3. The seasonal component from which the data was generated.
  4. The noise/remainder component from which the data was generated.

The right side shows:

  1. Generated sales data.
  2. The trend component identified by stl().
  3. The seasonal component identified by stl().
  4. The noise/remainder component identified by stl().

Note the close match between the two trend components and between the two seasonal components. This indicates that stl() works well in this instance.

Case 2: Weak seasonality and high, normally-distributed homoskedastic noise

An experiment in decomposition by Loess of a time series showing weak seasonality and high, normally-distributed homoskedastic noise

Again, stl() appears to work quite well.

Case 3: Weak seasonality and high, normally-distributed heteroskedastic noise

An experiment in decomposition by Loess of a time series showing weak seasonality and high, normally-distributed heteroskedastic noise

And stl() still seems to work fairly well. This is heartening, as it's common for the variance in a time series to increase as its mean rises—as is the case here.

How stl() Works

When calling stl() with s.window="periodic", the seasonal component for January is simply the mean of all January values. Similarly, the seasonal component for February is simply the mean of all February
values, etc. Otherwise, the seasonal component is calculated using loess smoothing (discussed below).

Having calculated the seasonal component, the seasonally-adjusted data (the original data minus the seasonal component) is loess-smoothed to determine the trend.

The remainder/noise is then the original data minus the seasonal and trend components.

The stl() function is quite flexible:

  • The seasonality does not have to run across a year. Any period may be used for this.
  • The decomposition process can accommodate seasonality that changes over time.
  • A robust decomposition process is available that is less affected by outliers than is the default.

An Introduction to Loess Smoothing

Loess ("locally-weighted scatterplot smoothing") uses local regression to remove "jaggedness" from data.

  1. A window of a specified width is placed over the data. The wider the window, the smoother the resulting loess curve.
  2. A regression line (or curve) is fitted to the observations that fall within the window, the points closest to the centre of the window being weighted to have the greatest effect on the calculation of the regression line.
  3. The weighting is reduced on those points within the window that are furthest from the regression line. The regression is re-run and weights are again re-calculated. This process is repeated several times.
  4. We thereby obtain a point on the loess curve. This is the point on the regression line at the centre of the window.
  5. The loess curve is calculated by moving the window across the data. Each point on the resulting loess curve is the intersection of a regression line and a vertical line at the centre of such a window.

To calculate a loess curve using R:

 
plot(cars$speed, cars$dist, main="Car Speed and Stopping Distance", xlab="Speed (mph)", ylab="Stopping Distance (ft)")
lines(lowess(cars$speed, cars$dist), col="red")

A scatterplot example of Loess fitting

Generating Test Data

Here, for completeness, is the code I used to generate the graphs I used in my tests:

 
# Parameters
start.ym <- c(2000, 1)
end.ym   <- c(2012,12)
n.years  <- 13

# Set the seed for the randomisation
set.seed(5)

# Create the 2nd derivative of the sales trend
ddtrend.sales <- qnorm(runif(12*n.years, 0.1, 0.90), mean=0, sd=0.4)

# Create the 1st derivative of the sales trend from the 2nd derivative
dtrend.sales    <- rep(NA, 12*n.years)
dtrend.sales[1] <- 0
for (i in 2:(12*n.years)) dtrend.sales[i] <- dtrend.sales[i-1] + ddtrend.sales[i]

# Create the sales trend from the 1st derivative
trend.sales    <- rep(NA, 12*n.years)
trend.sales[1] <- 30
for (i in 2:(12*n.years)){
   trend.sales[i] <- trend.sales[i-1] + dtrend.sales[i]
   if (trend.sales[i] < 0) trend.sales[i] = 0
}

# Create the seasonality
seasonality <- rep(c(10, 30, 22, 32, 26, 14, 2, -15, -14, -13, -16, -2), 13)

# Create the random noise, normally distributed
noise <- qnorm(runif(12*n.years, 0.01, 0.99), mean=0, sd=18)

# To make the noise heteroskedastic, uncomment the following line
# noise <- noise * seq(1, 10, (10-1)/(12*n.years-1))

# Create the sales
sales <- trend.sales + seasonality + noise

# Put everything into a data frame
df.sales <- data.frame(sales, trend.sales, dtrend.sales, ddtrend.sales, seasonality, noise)

# Set graphical parameters and the layout
par(mar = c(0, 4, 0, 2)) # bottom, left, top, right
layout(matrix(c(1,5,2,6,3,7,4,8), 4, 2, byrow = TRUE), widths=c(1,1,1,1,1,1,1,1), heights=c(1,1,1,1,1,1,1,1))

# Plot sales
tseries <- ts(data=df.sales$sales, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Sales (USD, 1000's)", main="", xaxt="n")

# Plot the trend
tseries <- ts(data=df.sales$trend.sales, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Actual Sales Trend (USD, 1000's)", main="", xaxt="n")

# Plot the seasonality
tseries <- ts(data=df.sales$seasonality, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Actual Sales Seasonality (USD, 1000's)", main="", xaxt="n")

# Plot the noise
tseries <- ts(data=df.sales$noise, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Actual Sales Noise (USD, 1000's)", main="", xaxt="n")

# Decompose the sales time series
undecomposed   <- ts(data=df.sales$sales, frequency = 12, start=start.ym, end=end.ym)
decomposed     <- stl(undecomposed, s.window="periodic")
seasonal 	   <- decomposed$time.series[,1]
trend	       <- decomposed$time.series[,2]
remainder	   <- decomposed$time.series[,3]

# Plot sales
tseries <- ts(data=df.sales$sales, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Sales (USD, 1000's)", main="", xaxt="n")

# Plot the decomposed trend
tseries <- ts(data=trend, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Est. Sales Trend (USD, 1000's)", main="", xaxt="n")

# Plot the decomposed seasonality
tseries <- ts(data=seasonal, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Est. Sales Seasonality (USD, 1000's)", main="", xaxt="n")

# Plot the decomposed noise
tseries <- ts(data=remainder, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Est. Sales Noise (USD, 1000's)", main="", xaxt="n")

Author: Peter Rosenmai

Posted on June 14, 2014 by Danielle Mosimann

Outlier Reporting and Benefits from Unit Testing in R

A recent AlignAlytics analysis project, reliant on Big Data processing and storage, required complex outlier reporting using the R statistical-programming language. This open-source software, combined with in-house statistical skills, allowed the team to quickly produce reports that are now the foundation of an on-going strategic analysis programme.

Unit testing is one part of this story and we hope Peter Rosenmai can continue to share more with us.

Getting started with unit testing in R

Unit testing is an essential means of creating robust code. The basic idea is simple: You write tests that the functions you code are required to fulfil; whenever you thereafter make changes to your code, you can run the tests to ensure that your functions all still work.

Such future-proofing is obviously useful, but unit testing brings other benefits. It forces you to break your code down into discrete, testable units. And the tests provide excellent examples of how your functions should be called. That can be really useful, especially when code commenting is shoddy or out of date.

Here’s an example of unit testing in the R statistical-programming language using the RUnit package. We have a file main.r in our current working directory. That file contains main(), our top-level function:

 
# main.r

# Load in the unit testing package
require(RUnit)

# Load in source files
source("string-utils.r")

# Function to run all unit tests (functions named test.*) in all
# R files in the current working directory
runUnitTests <- function(){
   cat("Running all unit tests (being functions that begin with 'test.')
        in all R files in the current working directory.")

   tests <- defineTestSuite("Tests", dirs=getwd(),
                            testFileRegexp = "^.*.[rR]$",
                            testFuncRegexp = "^test..+")

   test.results <- runTestSuite(tests)

   cat(paste(test.results$Tests$nTestFunc,    " test(s) run.n",
             test.results$Tests$nDeactivated, " test(s) deactivated.n",
             test.results$Tests$nFail,        " test(s) failed.n",
             test.results$Tests$nErr,         " errors reported.n",
             sep=""))

   if((test.results$Tests$nFail > 0) || (test.results$Tests$nErr > 0)){
      stop("Execution halted following unit testing. Fix the
	        above problem(s)!")
   }
}

main <- function(run.unit.tests=TRUE){
   if (run.unit.tests) runUnitTests()

   # Your code here...
}

The above code loads in from our current working directory the file string-utils.r:

 
# string-utils.r

# Load the unit testing package
require(RUnit)

# Function to trim a string
trim <- function(str){
   if (class(str) != "character"){
      stop(paste("trim passed a non string:", str))
   }

   return(gsub("^s+|s+$", "", str))
}
test.trim <- function(){
   checkTrue(trim("  abc ") == "abc")
   checkTrue(trim("a b ")   == "a b")
   checkTrue(trim(" a b")   == "a b")
   checkTrue(trim("")       == "")
   checkException(trim(3), silent=TRUE)
}

We run our top-level function using:

rm(list=ls(all=TRUE)); source("main.R",echo=FALSE); main()

That line removes all variables from the workspace, creates the functions in the above blocks of code and calls main(). The first thing main() does is call runUnitTests() to run all functions with names that start with "test." in all R files in the current working directory. Those are our unit tests.

For example, one of those unit test functions is test.trim(), the function shown above that checks that trim() is working as it should. Note how test.trim() not only checks expected return values but makes sure that trim() throws exceptions when it should. And what does trim() do? The examples in the test code should make it clear—which is why I like to keep the unit tests together with the functions that they test.

The above is the briefest of introductions to a huge topic. I could say a lot more about, for instance, test-driven development, refactoring and code coverage. But my aim here is not that ambitious. If you’re an analyst or a statistician, chances are you haven’t previously heard of unit testing. If that’s the case, I merely wish to suggest that you give the above a try the next time you find yourself coding in R. Unit testing really is worth the effort.

Author: Peter Rosenmai

Posted on January 19, 2014 by Danielle Mosimann