Author: Danielle Mosimann

Make or Buy? Organic growth or M&A?

A tale of two value creation opportunities 

Segmenting your way to pricing profits!
 

On the 29th December 2016 and again the 7th February 2017 the Financial Times wrote about an M&A boom. “The M&A boom will carry on…Many companies face poor organic growth prospects, forcing them to consider buying rivals or expanding in new territories…” Deloitte reports that 75% of executives expect deals to increase in 2017 while according to Moody’s “A ‘major theme’ of recent activity was positioning for the future through the acquisition of technology.”

Does this strong appetite for acquisitions create economic value?

Depending on the industry and the profile of the acquisition target, bigger may well be better. But if both the acquirer and the acquired were struggling to grow before the acquisition, are we not simply moving the problem to the future?

Acquisitions instantly increase revenues and usually earnings per share. From this perspective, deals that expand a business’ geographic footprint and improve the competitive position can be an attractive approach, especially in mature markets. Focus on managing the integration of both companies, improving economic performance and cost efficiency and leveraging greater market share will certainly create some economic value. However, after a few years these effects taper off and executives will need to decide what’s next.

Furthermore, the current high share prices or, perhaps better said, the high market valuations put pressure on executives to quickly deliver synergies. This is usually shorthand for cutting costs. According to a Bain study from late 2014, 70% of companies announce synergies that are higher than the scale curve suggests. Unsurprisingly, most companies will be disappointed by the actual synergies created.

Against that background, organic growth can be an attractive alternative. Executives often underestimate the power of organic growth. Clearly, organic growth takes more effort and time for growth to manifest itself, but our research shows that organic growth typically generates up to one third more economic value.

This is hardly surprising as the upfront investment for organic growth is lower, while for acquisitions the acquisition price usually includes a takeover premium. Therefore, over time the ROIC and ROE is higher for organic growth compared to acquisitions.

This is a good reason to look hard at creating internal growth opportunities and leave the acquisitions to competitors that have run out of ideas.

Karel Leeflang and Roland Mosimann, StrategyPod powered by AlignAlytics

Posted on March 10, 2017 by Danielle Mosimann

Segmenting your way to pricing profits!

Segmenting your way to pricing profits! For many, segmentation is the single critical factor that can drive a differentiated pricing agenda and therefore an accelerated route to profit growth. While this truth may be obvious to most, why is segmentation so difficult to implement in practice? Time and time again the gap between the theory of using segmentation for pricing excellence and the implemented reality seem to be very wide in many businesses.

So why is there such a gap between the theory and reality of good segmentation? A number of practical hurdles present themselves when it comes to segmentation:

  • There is the Data Segmentation Mountain to climb which is hard work!
  • There is segmentation, segmentation and segmentation!What level of sophistication are you able and willing to implement and how does it relate to perceived customer value?
  • The organisational and functional bias of the business will influence the segmentation. Is the segmentation drive more finance, sales or marketing led and how are these interconnecting influences integrated?
  • Segmentation “velocity” or its propensity to change is a dynamic that is often underestimated. If the segmentation cannot be kept up to date the whole framework for differentiated pricing quickly deteriorates.

1. The Data Segmentation Mountain

Data Segmentation Mountain 

The issue here is the complexity and size of data that needs to be segmented. Specifically this is a function of your channels or route to market, number of product items, your existing and prospect customers and then your market segments. Even with a limited number of product items, customers and channels the picture quickly gets complex e.g. 100 items x 100 customers x 4 channels = 40,000 elements to be segmented.

This then can be further complicated if the business works across more than one ERP system and where there needs to be a product and customer alignment between these systems. Master database management is a big theme within IT to ensure that customer records or product hierarchies are correctly mapped across various ERP systems. However, frequently these mappings do not include a more commercial and market specific segmentation. A market led segmentation would make sense to incorporate at the same time but unfortunately the organisational silos of your typical business seem often to hinder such an outcome.

Techniques for segmenting and categorising your data mountain exist and typically require a blending of business, data, analytical and IT capabilities. This blend allows for a systematic approach that can then process the segments even where the complexity and data load is daunting.

 

2. Segmentation vs segmentation vs segmentation

 

Segmentation will vary in sophistication partially because it is difficult to implement but also because business models work differently across industries and markets.

Segmentation & Ability to Price

The more differentiated a business’s customers can be segmented into value categories, the more a business can extract that value through differentiated pricing. Since value is related to the whole business proposition often involving intangibles (e.g. Service and Relationship), it is in fact the perceived customer value that counts. But this value segmentation needs to take account of the competitive or alternative options a customer is being offered before determining whether the value segmentation is real.

The mechanism to define value segments can take multiple forms depending on the industry and the ability to identify value drivers. Some of these may be behavioural (e.g. consumer behaviour for different occasions), others are linked to your contribution to a customer’s own value proposition, and then there are various market conditions that can also impact the perception of value.

So at its simplest form segmentation is typically about customer account size where a business will give better pricing terms for a large customer than a small customer. A more sophisticated approach will possibly look more strategically as to a customer potential and whether a market segment is attractive or not. At the most sophisticated extreme a business has systematically worked out a value based pricing segmentation where they fully understand their value to their customers and can defend their position against competitive alternatives.

 

3. Organisation or Functional Bias

 

Organisation or Functional Bias Are the organisational influencers of pricing coming more from Finance, Sales or Marketing? Typically the pricing agenda and therefore any implemented segmentation will be impacted. From a Financial perspective I will be looking, for example, at my gross margin and how the price relates to my costs. I may also be pushing for increased prices to improve margin but without necessarily fully understanding the market context and pressures. With the Sales function there is a tendency to look for aggressive pricing to close deals and often this pushes the organisation towards a customer account size segmentation. Ideally the Marketing function is able to look beyond the more short term influences that can come from Sales and Finance and drive through a segmentation that takes account customer and market value differentiators as well the product life cycle.

 

4. Segmentation Velocity

 

Segmentation Velocity 

In practice making the segmentation a manual exercise is likely to end in failure. The best approach is to frame the exercise by a logical hierarchy from which various algorithms can then be used to drive the detail of the segmentation. One way is to be market and customer led while for others it may be a product based segmentation. So yes part of the work is manual but only at a high level and then the detail gets allocated based on a number of business rules. Ultimately a rules based segmentation is the best way to keep the segmentation framework up to date and relevant.

 

Author: Patrick Mosimann

 

Posted on February 18, 2017 by Danielle Mosimann

Using Simulation to Determine Sample Sizes for a Study of Store Sales

Suppose a client wants to estimate the total sales value of widgets in a large number of stores. To do this, they will survey a sample of that population of stores. You need to provide the client with advice on choosing a suitable sample size.

Unfortunately, the client has little information to help you. They know that there are 1,000 stores that sell widgets. But they have no idea what the average store sales might be. All they know from previous studies is that the sales tend to be very right skew: Most stores sell very few widgets and very few stores sell a lot of widgets.

This is a fairly typical situation. We deal with a lot of sales data at AlignAlytics and typically find sales volumes and values (allowing for sale price to vary between sellers) to be very right-skew. Sales volumes are often well described by a Poisson distribution; A Pareto or a chi-square distribution often works well for sales values.

So, let’s suppose the client tells us that they expect the sales value per store to be distributed something like this:

Histogram of Sales

That looks very much like a chi-square distribution with two degrees of freedom. So we run the following R code:

 
# Create the distribution function.
r_dist_fn <- function(n) rchisq(n, 2)

# Get the dataframe of confidence intervals.
df_ci <- estimate_ci_from_sample(r_dist_fn, pop_size=c(1000), min_sample=10, max_sample=50, n_simulations=100000, confidence=c(50, 80, 90, 95))

That gives us a dataframe with rows that look like this:

Confidence Intervals Dataframe

This tells us, for instance, that if we use a sample of 20 stores, there is a 90% chance that the total sales in the population of stores is between approximately 72% and 150% of the estimate based on the sample.

Here's a bit of code that graphs that dataframe of confidence intervals:

 
par(ask=TRUE)
for (pop in sort(unique(df_ci$population_size))){
   
   # Subset df_ci by population size and get the confidence intervals calculated for that subset.
   df_ci_sub <- df_ci[df_ci$population_size==pop,]
   confidence_interval <- sort(unique(df_ci_sub$confidence))

   # Create an empty plot of the required size.
   plot(x=c(min(df_ci_sub$sample_size), max(df_ci_sub$sample_size)), 
        y=c(min(df_ci_sub$pop_total_div_estimated_total_ci_lower), max(df_ci_sub$pop_total_div_estimated_total_ci_upper)), 
        main=paste0("Confidence Intervals (", paste(confidence_interval, collapse="%, "), "%) for Population Total / Sample Total (Population: ", pop, ")"),
        type='n', xlab="Sample Size", ylab="Population Total / Sample Total")
   
   # Loop across the confidence intervals.
   for (ci in confidence_interval){

      # Graph a confidence interval.
      df_ci_sub_sub <- df_ci_sub[df_ci_sub$confidence==ci,]   
      polygon(c(df_ci_sub_sub$sample_size,                            rev(df_ci_sub_sub$sample_size)), 
              c(df_ci_sub_sub$pop_total_div_estimated_total_ci_lower, rev(df_ci_sub_sub$pop_total_div_estimated_total_ci_upper)), 
              col=rgb(0, 0, 1, 0.2), border=TRUE)
      
   }

   # Draw a horizontal line at y=1.
   lines(y=c(1, 1), x=c(min(df_ci_sub_sub$sample_size), max(df_ci_sub_sub$sample_size)))   
}
par(ask=FALSE)

And here's the output:

Confidence bands for Population Total / Sample Total

In the above graph, the widest confidence interval is the 95% interval, the thinnest (closest to the horizontal line at y=1) is the 50% confidence interval.

So, as before, there is a 90% chance that the total sales in the population of stores is between approximately 72% and 150% of the estimate based on the sample:

Confidence bands for Population Total / Sample Total with lines showing 90% interval for a 20 sample size

Using the above graph and the dataframe of confidence intervals, the client should be able to choose a sensible sample size. This will involve balancing the cost of increasing the sample size against the accuracy improvement achieved by doing so.

Finally, here's the estimate_ci_from_sample function used above:

 
estimate_ci_from_sample <- function(r_dist_fn, pop_size, min_sample, max_sample, n_simulations=1000, confidence=c(50, 80, 90, 95)){
   # Returns a dataframe of confidence intervals for the sum of a population of real numbers (values given by r_dist_fn) divided by
   # the sum of a sample from that population.
   #
   # r_dist_fn:     A function taking one parameter, n, and returning n random samples from a distribution.
   # pop_size:      A vector of population sizes, e.g. c(100, 1000, 2000).
   # min_sample:    If min_sample is in (0, 1) the minimum sample size is a fraction of the population size. If min_sample is a
   #                positive integer, the minimum sample size is a fixed number (= min_sample).
   # max_sample:    If max_sample is in (0, 1) the maximum sample size is a fraction of the population size. If max_sample is a
   #                positive integer, the maximum sample size is a fixed number (= max_sample).
   # confidence:    A vector of the required confidence intervals, e.g. c(50, 80, 90, 95).
   # n_simulations: The number of simulations to run per population size + sample size combination. The higher this is, the more
   #                accurate the results but the slower the calculation. 
   
   # Useful functions.
   is_int <- function(x) x %% 1 == 0
   sample_int <- function(spl, pop_size) ifelse(is_int(spl), min(spl, pop_size), round(spl * pop_size))
   
   # Check the min_sample and max_sample parameters.
   if (min_sample <= 0 || (min_sample > 1 && !is_int(min_sample))) stop("min_sample must be in (0, 1) or be an integer in [1, inf).")
   if (max_sample <= 0 || (max_sample > 1 && !is_int(max_sample))) stop("max_sample must be in (0, 1) or be an integer in [1, inf).")
   if (is_int(min_sample) == is_int(max_sample) && max_sample < min_sample) stop("max_sample should be greater than or equal to min_sample.")
   
   # Create the dataframe to hold the results.
   df_ci <- data.frame()

   for (population_size in pop_size){

      # Determine the sample size range.
      sample_int_min <- sample_int(min_sample, population_size)
      sample_int_max <- sample_int(max_sample, population_size)
      
      # Yes, it can happen that sample_int_min > sample_int_max, despite the parameter checks, above.
      if (sample_int_min <= sample_int_max){
      
         for (sample_size in seq(sample_int_min, sample_int_max)){
            
            cat(paste0("\nCalculating ", n_simulations, " ", sample_size, "-size samples for population size ", population_size, "."))
            
            # Calculate the pop_total_div_estimated_total vector.
            pop_total_div_estimated_total <- c(NA, n_simulations)
            for (i_sim in 1:n_simulations){
               population <- r_dist_fn(population_size)
               sample_from_pop <- sample(population, sample_size)
               pop_total_div_estimated_total[i_sim] <- sum(population) / (population_size * mean(sample_from_pop))
            }
            
            # Loop across the required confidence levels.
            for (conf in confidence){
               # Calculate the confidence interval.
               alpha <- (100 - conf) / 100
               ci <- quantile(pop_total_div_estimated_total, probs=c(alpha / 2, 1 - alpha / 2))
               
               # Add a row to the dataframe.
               df_ci_row <- data.frame(population_size                        = population_size, 
                                       sample_size                            = sample_size, 
                                       confidence                             = conf, 
                                       pop_total_div_estimated_total_ci_lower = ci[1], 
                                       pop_total_div_estimated_total_ci_upper = ci[2])
               df_ci <- rbind(df_ci, df_ci_row) 
            }
            
         } # Ends sample_size for loop.

      } # Ends if.

   } # Ends population_size for loop.

   return(df_ci)
}
Posted on May 26, 2016 by Danielle Mosimann

Drawing a Grid of Plots in R — Regression Lines, Loess Curves and More

We provide here an R function that draws a grid of plots, revealing relationships between the variables in a dataset and a given target variable.

Scatterplots in the grid include regression lines, loess curves and the adjusted R-squared statistic.

Boxplots have points indicating the group means. Box widths are proportional to the square-root of the number of observations in the relevant group. The p-value is shown for an F-test: p < 0.05 indicates a significant difference between the means of the groups. But don't take this p-value on faith: Be sure to check the assumptions of the one-way ANOVA model.

Mosaic plots include the p-value of a chi-square test of independence: p < 0.05 indicates that there is a significant relationship between the two variables under consideration. The number of plot cells with a count under five is shown; if this is greater than zero, the chi-square test may be invalid. Here's an example using a continuous target variable:

 
mtcars2 <- mtcars
mtcars2$cyl  <- as.factor(mtcars2$cyl)
mtcars2$vs   <- as.factor(mtcars2$vs)
mtcars2$am   <- as.factor(mtcars2$am)
mtcars2$gear <- as.factor(mtcars2$gear)
mtcars2$carb <- as.factor(mtcars2$carb)
multiplot(mtcars2, 'disp', c(2, 5))

Drawing a Grid of Plots in R

This example has a categorical target variable:

 
multiplot(mtcars2, 'gear', c(2, 5))

Drawing a Grid of Plots in R

Finally, here’s the multiplot function:

 
multiplot <- function(df_data, y_column, mfrow=NULL){
   #
   # Plots the data in column y_column of df_data against every other column in df_data, a dataframe.
   # By default the plots are drawn next to each other (i.e. in a row). Use mfrow to overide this. E.g. mfrow=c(2, 3). 
   #
   
   # Set the layout
   if (is.null(mfrow)) mfrow <- c(1, ncol(df_data) - 1)
   op <- par(mfrow=mfrow, mar=c(5.1, 4.1, 1.1, 1.1), mgp = c(2.2, 1, 0))
   on.exit(par(op))

   for (icol in which(names(df_data) != y_column)){
      x_column <- names(df_data)[icol]
      y_x_formula <- as.formula(paste(y_column, "~", x_column))
      x_y_formula <- as.formula(paste(x_column, "~", y_column))
      x <- df_data[[x_column]]
      y <- df_data[[y_column]]
      subtitle <- ""
      
      if (is.factor(x)){
         if (is.factor(y)){
            # Mosaic plot.
            tbl <- table(x, y)
            chi_square_test_p <- chisq.test(tbl)$p.value
            problem_cell_count <- sum(tbl < 5)
            subtitle <- paste("Chi-Sq. Test P:", round(chi_square_test_p, 3)," (< 5 in ", problem_cell_count, " cells.)")
            plot(y_x_formula, data=df_data)
         } else {
            # Vertical boxplot.
            fit <- aov(y_x_formula, data=df_data)
            f_test_p <- summary(fit)[[1]][["Pr(>F)"]][[1]]
            subtitle <- paste("F-Test P:", round(f_test_p, 3))
            boxplot(y_x_formula, data=df_data, horizontal=FALSE, varwidth=TRUE)
            means <- tapply(y, x, function(z){mean(z, na.rm=TRUE)})
            points(x=means, col="red", pch=18)
         }
      } else {
         if (is.factor(y)){
            # Horizontal boxplot.
            fit <- aov(x_y_formula, data=df_data)
            f_test_p <- summary(fit)[[1]][["Pr(>F)"]][[1]]
            subtitle <- paste("F-Test P:", round(f_test_p, 3))
            boxplot(x_y_formula, data=df_data, horizontal=TRUE, varwidth=TRUE)
            means <- tapply(x, y, function(z){mean(z, na.rm=TRUE)})
            points(x=means, y=1:length(levels(y)), col="red", pch=18)
         } else {
            # Scatterplot with straight-line regression and lowess line.
            adj_r_squared <- summary(lm(y_x_formula, df_data))$adj.r.squared
            subtitle <- paste("Adj. R Squared:", round(adj_r_squared, 3))
            plot(y_x_formula, data=df_data, pch=19, col=rgb(0, 0, 0, 0.2))
            abline(lm(y_x_formula, data=df_data), col="red", lwd=2)
            lines(lowess(x=x, y=y), col="blue", lwd=2) 
         }
      }
      title(sub=subtitle, xlab=x_column, ylab=y_column)
   }
}
Posted on March 21, 2016 by Danielle Mosimann

The Changing Face of Vendor Analytics

Our most recent vendor project was an interesting change in direction compared to several vendor related projects we have previously worked on. We were asked to build out a vendor reporting capability that went beyond simple spend analytics and also brought in data from online sources such as Twitter, Google, Bloomberg, Reuters and Facebook.

This project brought forward interesting trends not just in the area of vendor analytics but also in how datasets that underpin traditional reporting areas such as sales and budgeting are likely to expand in scope to include more and more data from online sources.

Some companies are constantly meeting vendors and they need to make sure that they are asking the right questions and signing off the correct deals in these meetings. For this they need their staff to understand more than just the historical spending with the vendor. They need to know how that company is represented in the news, what key events the vendor has been involved in, such as mergers or financial results, and what people are saying about them.

Historically BI solutions have focused on summarizing and visualizing the internal data side of the business – sales, spending, CRM… Users would then supplement this with their own knowledge and research of customers, competitors and suppliers to build-up an understanding of their environment. Recently however, our aim was to improve how users gather information from research. In order to achieve this, a BI solution needs to capture a wide variety of data sources which are then analysed, aggregated and presented back to the user in an easily understood way. Then, by automatically combining this with spend data you can also allow users to better understand the relationships between datasets.

New role of BI Solutions

In order to pull all this together, there are 4 key areas that need to be built out:

    • Text mining – you need to find a way of summarizing the large amounts of unstructured content that are brought in from online data sources – after reviewing several options we went with AlchemyAPI.
    • Data mashing – a more traditional database layer is needed to combine summary results from the unstructured data with internal vendor spend data – for this we stuck with SQL Server.
    • Reporting layer – To deliver the solution we used Tableau to create a series of reports that allowed users to interact with the combined data.

 

Our final architecture looked something like this:

Data flow architecture

 

This project has led to several useful findings:

    • Overall the area of vendor analytics is enhanced by blending the spend data with online data sources. Events such as a vendor being acquired by another company, a successful project collaboration or a sales event need to be visible by the output from a tool.
    • The ability of AlchemyAPI to mine insights from text content is critical. This includes sentiment analysis but also tackles entity extraction – the process of relating people, places, companies and events to articles.
    • With AlchemyAPI you don’t have to store the content of every article (which is also why we chose it as the best tool for text analytics). You can simply send AlchemyAPI the URL to the relevant article and they analyse the content – other solutions require you to capture the full content or an article and send it to their applications.
    • ElasticSearch delivers what is needed from a NoSQL database with its flexibility to store and analyse large scale unstructured data from multiple sources. Its ability to allow multiple processes to collate and analyze data, simultaneously in real time, gives it significant advantages over other data storage solutions.
    • ElasticSearch delivers what is needed from a NoSQL database with its flexibility to store and analyse large scale unstructured data from multiple sources. Its ability to allow multiple processes to collate and analyze data, simultaneously in real time, gives it significant advantages over other data storage solutions.
    • Having built several solutions in Tableau we are aware of its traditional strengths. However, for this kind of project it is the ability to store web links in a dashboard which users can then access that is particularly useful. So if a spike in negative sentiment occurs for a supplier, a user can quickly navigate from a trend chart in Tableau to a summary of the articles content, again stored in Tableau, to ultimately to the most useful articles online.

 

In conclusion, we found that the area of vendor analytics can be enhanced by combining traditional spend data with online content. The process of combining unstructured online data with spend and sales data is likely to become the norm in future BI developments as companies seek to fill in the gaps that internal data cannot answer on their own.

 

Author: Angus Urquhart

Posted on March 1, 2016 by Danielle Mosimann

Is there ever a good time to do a price increase?



The short answer: NEVER and ALWAYS!

The longer answer:

A price increase is always difficult to achieve successfully and yet doing nothing is a gradual recipe for financial disaster. Why? Your costs are never static, so within 5 years your profit margin could easily be ZERO.

Cost Inflation Impact on Profit without Price Increase



Of course good cost and supplier management can counteract this trend and is a very common and effective strategy BUT, ultimately you cannot “cut” your way out of a profit gap without damaging the long term viability of the business.

 

Price Increase Cartoon

 

Price increases are difficult because no one believes it’s easy to implement, neither are they acceptable to the customer nor seen as competitively achievable.

It is certainly not easy to analyse and think through the various implications of a price increase; OK 10 customers and 10 products might be easy enough (10×10 = 100 customer price combinations), but since most businesses are dealing with 100,000’s price combinations the implementation complexity is usually very significant.

Also the taboo of communicating a price rise usually raises strong emotions and outright fear by sales reps. “What will the client say”, “I’m going to be crucified by procurement”, “I’m going to lose the contract (and my bonus)”…

The internal resistance, especially by the front line, is therefore understandable and real, often leading to internal political lobbying to neutralize or partially counteract a price initiative. All this can become quite heated and damaging if left unresolved.

Of course both camps of the argument, to increase prices or not, are both partially right. The competitor angle especially tends to be the key argument against any price increase – “we are going to lose share”, “the competitive alternative is better value at that price”. And, while this is obviously very true, it is also a common and dangerous excuse. This rational argument about a competitive threat can become a slogan around which all the other reasons and resisting stakeholders attach their defensive flag to the “no price increase” mast.

As always the issue is to find the right balance and truly understand where the price increase opportunities lie. Of course the high volume product items (SKUs) are candidates that need extreme caution since they are typically the headline product from which most clients and competitors can compare and undercut. But, in that complex mire of the product detail are typically interesting opportunities and pricing nuggets.

Looking at the pricing increase challenge from the point of view of mining the data complexity usually offers up some interesting and tactically defensible opportunities. These will vary in rationale, for example: legacy product, servicing opportunities, supply chain performance, product or customer tail, volume commitments etc. Thinking through a range of business rules and criteria and positioning these correctly will offer a number of on-going reasons for increasing prices that are both defensible and sustainable.

Taking such a granular and more targeted approach to price increases will offer a multitude of small incremental options. The approach will typically leverage a multi-segmentation approach and can typically lead quite easily to 2-3 margin increase to add to that bottom line. By managing a tailored, adaptive and structured approach to pricing that uses complexity to its advantage you can avoid many of the pitfalls of reckless price increase initiatives.

In short, the best approach to pricing is finding opportunities in the difficulties and constraints that others prefer to avoid. A business’s capability to navigate complexity will also avoid the trap of a top down reactive edict that compels the business to increase prices or drive effort towards premium segments without a sound approach and rationale.

So yes there is NEVER a good time to do a price increase but equally you ALWAYS need to consider doing so despite the difficulties and reasons not to!

Author: Patrick Mosimann

Posted on January 12, 2016 by Danielle Mosimann

An Elasticsearch Journey

Over the last 3-5 years big data technologies have become an increasingly important factor in analytics. A few years ago our company knew that as an advanced analytics provider it was integral for our skills to include working with big data. In order to achieve our best analysis we had to move away from traditional SQL to unstructured data and the team explored different platforms which would enable us to do this. Elasticsearch stood out initially due to it being structure agnostic and its ability to store ALL types of data and so we embarked on a journey of learning with this application. By investing our time and skills into Elasticsearch our company was investing in our long-term abilities, as success in the ever-changing analytics landscape requires growth alongside your tools.

Over the years of using Elasticsearch both our company and Elasticsearch have vastly improved their capabilities and so this piece will cover the key projects in our journey and how Elasticsearch facilitated them. Additional upcoming blog posts will individually explore these big data projects and the multitude of benefits and features of Elasticsearch that enabled them in more detail.

ES is used to represent the term Elasticsearch.

Elasticsearch as a Data Store (2012)

Until a big data proof of concept for a travel industry client we had mostly used SQL. However, SQL has significant cost implications when scaling up and we needed a highly concurrent application to quickly write to a data store without costing an arm and a leg.

CouchDB could get the data in but it wasn’t easy to extract the data in a meaningful way. Although exporting the data from ES (version ~0.10) was a challenge, as a data store it was beneficial in a number of ways:

  • Easy to write data
  • Brilliant search functionality
  • Open source

Elasticsearch as a Reporting Store (2013)

In an analytics and audit project for another online travel agency we analysed web server logs to work out their conversions and success rate from their enormous amount of online data. We needed to, not only write data, but to import, analyse and report 250 million records of unstructured data without, once again, costing an arm and a leg.

From using ES (~0.19) as a reporting store, the team discovered that ES:

  • Allows for a high rate of ingestion
  • Keeps data compact and therefore storage costs low
  • Only enables analysis of data by means of “facets” – an aggregated level of data based on a search query

With limited means of analysis within ES we moved the relevant data to RedShift in order to analyse and segment it. Although this was a high cost option and made up to 10%-15% of our total project cost. However, when it came to reporting we returned to our beloved platform using both ES and their dashboard, Kibana, to provide high standard reporting for the project.

Elasticsearch as an Analytical Store (2014)

We needed to find a better way to analyse the necessary data within ES and so the team developed two different Python libraries for internal analysis using ES, Python and Pandas. One, Pylastic, used ES as a wrapper for Python and the other, Pandastic, used Pylastic as a wrapper for Pandas. We gained:

  • A unified data layer
  • Simplified querying. SQL instead of JSON for our data scientists.
  • Bespoke terms to better fit our company, taking away Elasticsearch jargon.
  • Easy data extraction.

Pylastic Example
On the left, sample raw Elasticseach JSON query. On the right, sample code in Pylastic to write the same query as on the left.

All our data writing, storing, reporting and analysis for big data projects was now possible to achieve using ES which enabled us to deliver our results to clients faster as we didn’t have to move data around as much as in the past.

Elasticsearch and GeoSpatial Analysis

Another big data project was for a client who wanted a database for their salesforce in order to provide knowledge of where to distribute their products in Nigeria. The project involved mapping all outlets in this country. Interviewers in the field had hand devices which recorded surveys, geo locations and images and their area coverage was targeted. The team had to take this vast amount of recorded data and store and analyse it in order to put the database together. Once again ES was chosen over SQL due to:

  • Presence of multiple data sources, including survey data, log data and images. ES can store ALL types of data.
  • A requirement for a large number of fields/columns. (>1000)
  • Geospatial features which supported the necessary geospatial queries. For example, we were able to specify geopoint & polygon data types in ES.
  • Evolving data. ES had no problem with the changing survey data, such as adding new fields, throughout the project.
  • ES could support “live reporting” and fast real time querying.
  • We expected the reporting data to be BIG and ES was well placed to manage this volume. Currently there are over 300 million records and we have only ~1/5 of the country covered.

Interviewer Summary Report
Interviewer Summary Report: shows the path and the recorded stores for each interviewer by day.

Elasticsearch Benchmarking

As ES had become integral to our data and analysis work we were constantly looking into ways in which to improve performance and thus more rapidly deliver valuable insights to our clients. The team engaged in an ES benchmarking exercise with Bigstep’s Full Metal Cloud infrastructure and ran ES queries on 10 million documents (approx 4GB of compressed data). We knew the metal cloud would perform better compared to traditional cloud based dedicated servers but the metal cloud performance results were incredible. Results were consistently 100-200% better than existing infrastructure. As the queries became more complex, such as the geo distance calculations for our geospatial analysis, the positive performance difference was highlighted even further.
Main factors for Bigstep’s superior performance:

  • Wire-speed network
  • Hand-picked components
  • All-SSD storage based on enterprise drives

An Elasticsearch Future

Reviewing our journey with ES demonstrates that ES has been able to deliver on a breadth of projects and empowered us to vastly improve our big data and unstructured data analysis knowledge and skills. It has become a crucial tool in our team and will continue to be so in the future as both our capabilities progress to be even more advanced. Although we’re not really the type of company to make bold proclamations we would probably call ourselves Elasticsearch champions, as our team can’t seem to recommend it enough.

Author: Danielle Mosimann

Posted on June 6, 2015 by Danielle Mosimann

Big Data Journey: A few battle scars later

We interviewed Amit and Adam from our Advanced Analytics Innovation Lab – a couple of our data science leaders that have been involved in our Big Data Journey.  They discuss battle scars and what they have learned from the 4vs, data capture, storage and security to Hadoop, Redshift and Elasticsearch.

A Big Data Interview

Q: What do you think has been the biggest challenge? Has it been the integration and unification of some of that data? And the different variety of data we’re dealing with? Is it being able to handle it fast? Is it being able to store it?

I think the first challenge was actually getting our head around where the best place to start was. There’s quite a lot of buzz words and quite a lot of different ideas which are all related to big data. Each project is different. It took us a long time to get to understand all the different components that make up that ecosystem and decide which bits are used first. So when you go into a certain type of project should you start with Hadoop or should you start with Elasticsearch? They do different things. It’s getting yourself up that learning curve of what each of those niches are and where you use them for different things. That was a real learning curve for us. After two years we now feel that we’re there and we’ve done a lot with different systems and feel a lot more comfortable to choose the right tools for the job.

Would you choose Hadoop? Would you choose Redshift? Or if you’re doing visualisation maybe you choose to base it on Elasticsearch because of the quickness? The speed of return of the queries and things like that.

“Ecosystem” is a great description for working with big data because it’s not a linear process. It’s an ecosystem you have to grow. And there isn’t one tool that solves everything. As we’ve been evaluating different tools they’ve been on a journey too. Like Elasticsearch, they’ve come a long way from when we first started using them. They’ve learnt from our use cases in which they saw our speed issue, our storage issues and in terms of some of the adhoc querying as well. But then in terms of ecosystem, we have a lot of SQL analysis tools here as well. People who know SQL. So we have played quite a bit with Redshift as well.

There are different stages in projects. We often start with getting things going really quickly and trying to understand things. And there are certain tools that are a lot quicker to get going with. Tools that are used are based on the client need and the problem we need to solve.

Q: So all of this is constantly changing – we’re constantly evaluating. But how has it been with clients? In my experience it’s been trying to get clients to understand that their data policies need to change. Changing their perception of storage and retention so they are able to defensively delete certain aspects that they don’t need. But also being able to capture things that they might not have captured in the past so as to give a rich story, what they want to read from their analysis. How has your experience been from a technical perspective?

Even beyond a technical perspective it’s about understanding what data you need to capture and how long you need to store it for. We found that actually a lot of companies need to go through the process of trying to understand what they need to get the best value from their data. What they need to keep, how they best store it and the security they need to put in place. Deciding the teams as well, to work with it. With the technical aspects, things like security are not as hard as deciding policies in the beginning and getting that structure in place. We found that going through that process of understanding what you need can be the most difficult thing.

The questions come. Why do we have to do certain things? Why do you want that much data? How are you actually going to transfer that data? So all 4 Vs come into play, not just for us to understand but for the client to understand.

Tackling how you use cloud computing if you have sensitive data. Do you mask the data? What actually is or what isn’t sensitive? The IT infrastructure of a company will have their policies but how do they evaluate if something is secure or not? It’s been an issue we’ve had to work through a number of times.

How you move these massive sums of data between different cloud solutions or even within regions within cloud solutions can often be a problem if you have terabytes and terabytes of data. It can be pretty difficult when keeping security in mind and ensuring that process is secure. We came across some products and tools which can move absolute bucketloads of data in seconds and with security in place which we wouldn’t have come across if we weren’t in this space. That sort of thing’s only possible once you get in there and go on that journey.

Q: How has the leap from traditional SQL to dealing with big data been?

We’ve been through a bit of a transition here. There has been a lot of work around different technologies which aren’t directly related to the big data space which other people have been working on. JavaScript visualisation for example which definitely compliments our work. There’s a lot of different work going on in various areas that all come together in the same sort of field. We’ve progressed as a company in massive leaps and bounds.

Q: Are statistics capabilities easier with the big data ecosystem?

We have a number of stats guys and they’re getting into that role of doing things not just in small scale but on large scale and using the whole set available rather than sampling. Definitely advances in that area as well.

In terms of Big Data, there has been those who say that everyone’s talking about it, nobody really knows how to do it. I like to think we’ve graduated.

In the last 2 years we’ve gone from talking conceptually about big data and all these tools to ticking a lot of different boxes in terms of use cases. The fact that we have different teams working on different projects means we get a large variety of use cases. In that sense we almost know what we’re talking about. The reason I say almost is because it’s an ongoing journey. There are certain things we haven’t touched upon yet and we can always do more!

Posted on March 18, 2015 by Danielle Mosimann

Goldilocks and the D3 Bears

As open-source software goes from strength-to-strength, one area particularly close to my heart is that of charting libraries and APIs.  Like many areas of software, the open-source community has revolutionised the way that data is presented on the web.

Arguably the most significant open-source visualisation project is called D3 (data-driven documents), however many an inexperienced JavaScript developer’s hopes have been dashed against the rocks of D3. It is NOT a charting API.  It’s a library which helps a developer map data to elements on a web page (think JQuery for data if you are technically minded).  That means you CAN create charts, in fact I’d say you can create any chart you could ever imagine and you can do it much more easily than with raw JavaScript, however the spectrum of JavaScript programmers is broad and I would argue that D3 still requires a fairly high level of skill if you want to do something from scratch.  In the hands of an expert, the results are magnificent but sadly the majority of D3 implementations I have seen appear to have been built by taking an example and hacking at it until it fits the required data.

To address this, a number of open-source projects have emerged with the specific goal of drawing charts in D3.  Their restricted reach leads to a greater simplicity and opens the door to many more users.  However when I came to look for an API which meets the needs of our analysts – many of whom come from an Excel rather than JavaScript background – we couldn’t find that crucial Goldilocks zone between complexity and limitation.  This was what spurred us to create our own. The result is dimple, a JavaScript library which allows you to build charts using a handful of commands.  The commands can be combined in myriad ways to create all sorts of charts and the results can be manipulated with D3 if you need to do something really unusual.  The main limitation is that it only supports charts with axes for now (pie charts are in the works), but it works in a way which ought to be easily understood by anybody with some basic programming knowledge.

Dimple Price Range Chart

The example above is from the advanced section of the site, but still has less than 20 lines of JavaScript.  To get started with a simpler example, why not copy and paste the code below into notepad, save it as MyChart.html, open it in your favourite browser and then sit back admiring your first bar chart.

   var svg = dimple.newSvg("body", 800, 600);
   var data = [
     { "Word":"Hello", "Awesomeness":2000 },
     { "Word":"World", "Awesomeness":3000 }
   ];
   var chart = new dimple.chart(svg, data);
   chart.addCategoryAxis("x", "Word");
   chart.addMeasureAxis("y", "Awesomeness");
   chart.addSeries(null, dimple.plot.bar);
   chart.draw();

The brevity is good but it’s the flexibility and readability which we were really shooting for.  So try switching the letters “x” and “y” on the add axis lines and you get a horizontal bar chart, change “bar” in the “addSeries” line to “bubble”, “line” or “area” and you’ll get those respective chart types.  Or better still copy the “addSeries” line and change the plot to get a multiple series chart.  You can go on to add multiple axes, different axis types, storyboards (for animation), legends and more.  For ideas see the examples, or if you are feeling brave the advanced examples which I try to update regularly.

Author: John Kiernander

Posted on January 31, 2015 by Danielle Mosimann

A Single Version of the Truth – Is it Just a Myth?

Why do we hear companies talking about a “single version of the truth”? It is because of the frustration they have experienced when multiple people argue about which numbers are correct rather than focusing on what the metrics mean. Finding out what the metrics really mean would allow them to improve operational performance and business results. They want data consistency so they can understand trends, variances, causes and effects. They want to be able to have easy and quick access to information that they can trust. They do not want to wait for days to get hold of data they need but may not even be able to rely on from IT or an overworked analyst.

In many companies existing data warehouses and reporting systems are so fragmented and widely dispersed that it’s impossible to determine the most accurate version of information across an enterprise. A proper information strategy with a solid MDM and infrastructure often takes years to develop and requires substantial upfront investment. In the meantime, companies’ departments are left to develop their own short-term solutions resulting in too many data sources reporting different information. This information is incomplete, lacks structure and is sometimes even misleading.

Single version of the truth
A single version of the truth – is it just a myth?

Imagine a mid-sized and fast growing international business with a strong portfolio of brands. They know very well how to manufacture a good quality product and effectively pitch it to a consumer. However, they struggle with large amounts of data sitting in multiple Excel spreadsheets and legacy systems without having access to analytics and consistent reporting. Central management does not often have much of an in depth view of what is happening in local markets. It takes a long time and some frustration to get a simple market-share data point. Not to mention the time and frustration to gain insight into competitor and product performance on a regular basis.

Would it not be great to have a central single source of reliable & consistent information enabling quick and easy access and reporting, reducing manual work and delivering performance results quicker? It is possible. You don’t have to ‘boil the ocean’ and to try to incorporate all existing data at once. Start with market or sales data to get a consistent and accurate view of the critical KPIs, to improve segmentation and to get an insight into key areas before adding-on more…

With new technology, methodology for data unification and the emergence of data visualisation tools that are revolutionising decision making, that panacea of the “single-version” isn’t just a dream.  Whether it’s your legacy vendors or open-source options, this new wave of technology enables delivery of the right information and analysis to the right person, at the right time and on a regular basis. It can help to overcome the need for large infrastructure investment while developing your metrics; stakeholder strategy and reporting requirements.

It’s not an impossible dream and although the wave of options, methodologies and technologies might feel like an overwhelming wave, you can ride the swell towards an optimal solution.

Author: Nadya Chernusheva

Posted on November 24, 2014 by Danielle Mosimann

Analytics, Decision Making & Wine

As our society and economy has evolved, we’ve become accustomed to having an abundance of options in just about any decision we must make.  However, it’s the excessive alternatives we are constantly confronted with that often complicate and delay decision making in our personal and professional lives.  For example, I went out to dinner the other night and wanted to have a glass of wine with my meal.  The waiter handed me a book an inch and a half thick containing their vast array of wine selections. Instead of wading through the pages, I quickly came up with a set of criteria to help me focus and determine my selection.

wine-and-food-pairing-chart

To start, white and rosé wines were immediately eliminated. I only drink white wine if I’m eating fish. Since I knew that I wasn’t going to order fish, it was simple for me to eliminate the whites (I ordered a pasta appetizer and beef entrée in case anyone is interested). Rosé isn’t really my thing unless I’m at an outdoor party in the summer and it’s mixed with fruit (à la homemade sangria).

I then narrowed my selection according to the type of taste & texture I wanted to experience on this particular night, I was in the mood for a smooth, even balanced, medium bodied, but not too fruity taste. This criterion narrowed my quest to the great varietals of Pinot Noir and Chianti. Because I had ordered a pasta based appetizer, my search led me to select a glass of Chianti (this also went great with the wood fired Tuscan style bread and homemade olive oil).

Finally, I assessed the value and cost (this is often where most people start every decision, particularly in business). I selected a $13 glass which was about middle of the road for the Chianti price point range.  Boom! I just solved my wine selection problem in less than a minute using simple qualitative analytics and all I had to do was establish a core set of criteria that fit my personal needs.

Businesses should approach decision making in a similar fashion. By establishing a list of factors that matter to your organization today and that will also matter in the future, it will allow you to differentiate yourself amongst competitors and result in continuous growth.  Begin collecting data surrounding these factors, constantly evaluate the outcomes of your decisions and modify/tweak your approaches.  Let’s put this into some context.

Say for instance you’re an executive at a multinational manufacturer and part of your strategy is to strive for continual efficiency through operations.  You may decide to invest in multiple Business Intelligence (BI) tools in order to meet this strategic initiative. The question then becomes who, where, and how should your dollars be invested to maximize the greatest return? Again, an abundance of alternatives exist.

In order to solve this problem, the organization may decide to embark on creating a BI roadmap and assess factors that will determine the analytical capabilities of the current operation and where they should go in the future.  For instance, the manufacturer may want to assess the availability/timeliness of information. This factor will determine if the information is delivered to the users when required in order to do an effective job. Drilling down further, you may then assess the information’s relevancy. Does what I receive even matter in the context of my operating unit? If not, why would I continue to receive such information and what solutions are out there for me to resolve this issue would be typical follow up questions upon further evaluation. Asking simple yes/no questions such as “does the current technology allow me to view information in real-time?” can be just as insightful, particular for a manufacturing production facility.

Decision-making doesn’t have to be challenging or scary. If you take the time to set up a repeatable model, subject to regular evaluation and refinement, which fits your needs you can now begin to solve, simple (i.e. what am I going to eat for dinner tonight?) or complex (i.e. what new markets should we be competing in during the next 1, 3, or 5 years?) issues with greater speed and accuracy.

So, now that you’ve decided that analytical decision making is vital to your personal and professional success, let’s toast over a glass of wine (red preferably)!

Author: Gabe Tribuiani 

Posted on September 4, 2014 by Danielle Mosimann

Seasonal Decomposition of Time Series by Loess—An Experiment

Let’s run a simple experiment to see how well the stl() function of the R statistical programming language decomposes time-series data.

An Example

First, we plot some sales data:

 
sales<-c(39,  73,  41,  76,  75,  47,   4,  53,  40,  47,  31,  33,
         58,  85,  61,  98,  90,  59,  34,  74,  78,  74,  56,  55,
         91, 125,  96, 135, 131, 103,  86, 116, 117, 128, 113, 123)
time.series <- ts(data=sales, frequency = 12, start=c(2000, 1), end=c(2002, 12))
plot(time.series, xlab="Time", ylab="Sales (USD)", main="Widget Sales Over Time")

Observe the annual seasonality in the data:

A time series of widget sales

We apply R's stl() function ("seasonal and trend decomposition using Loess") to the sales data:

 
decomposed  <- stl(time.series, s.window="periodic")
plot(decomposed)

This decomposes the sales data as the sum of a seasonal, a trend and a noise/remainder time-series:

A time series of widget sales decomposed into seasonal, trend and noise/remainder components

We may easily extract the component time series:

 
decomposed <- stl(time.series, s.window="periodic")
seasonal   <- decomposed$time.series[,1]
trend	   <- decomposed$time.series[,2]
remainder  <- decomposed$time.series[,3]

This allows us to plot the seasonally-adjusted sales:

 
plot(trend+remainder,
main="Widget Sales over Time, Seasonally Adjusted",
ylab="Sales (USD)")

A time series of seasonally-adjusted widget sales

An Experiment

How well does stl() extract trend and seasonality from data? We run three simple graphical investigations.

Case 1: Strong seasonality and low, normally-distributed homoskedastic noise

An experiment in decomposition by Loess of a time series showing strong seasonality and low, normally-distributed homoskedastic noise

The left side of each of the above images shows, from top to bottom:

  1. Generated sales data.
  2. The trend component from which the data was generated.
  3. The seasonal component from which the data was generated.
  4. The noise/remainder component from which the data was generated.

The right side shows:

  1. Generated sales data.
  2. The trend component identified by stl().
  3. The seasonal component identified by stl().
  4. The noise/remainder component identified by stl().

Note the close match between the two trend components and between the two seasonal components. This indicates that stl() works well in this instance.

Case 2: Weak seasonality and high, normally-distributed homoskedastic noise

An experiment in decomposition by Loess of a time series showing weak seasonality and high, normally-distributed homoskedastic noise

Again, stl() appears to work quite well.

Case 3: Weak seasonality and high, normally-distributed heteroskedastic noise

An experiment in decomposition by Loess of a time series showing weak seasonality and high, normally-distributed heteroskedastic noise

And stl() still seems to work fairly well. This is heartening, as it's common for the variance in a time series to increase as its mean rises—as is the case here.

How stl() Works

When calling stl() with s.window="periodic", the seasonal component for January is simply the mean of all January values. Similarly, the seasonal component for February is simply the mean of all February
values, etc. Otherwise, the seasonal component is calculated using loess smoothing (discussed below).

Having calculated the seasonal component, the seasonally-adjusted data (the original data minus the seasonal component) is loess-smoothed to determine the trend.

The remainder/noise is then the original data minus the seasonal and trend components.

The stl() function is quite flexible:

  • The seasonality does not have to run across a year. Any period may be used for this.
  • The decomposition process can accommodate seasonality that changes over time.
  • A robust decomposition process is available that is less affected by outliers than is the default.

An Introduction to Loess Smoothing

Loess ("locally-weighted scatterplot smoothing") uses local regression to remove "jaggedness" from data.

  1. A window of a specified width is placed over the data. The wider the window, the smoother the resulting loess curve.
  2. A regression line (or curve) is fitted to the observations that fall within the window, the points closest to the centre of the window being weighted to have the greatest effect on the calculation of the regression line.
  3. The weighting is reduced on those points within the window that are furthest from the regression line. The regression is re-run and weights are again re-calculated. This process is repeated several times.
  4. We thereby obtain a point on the loess curve. This is the point on the regression line at the centre of the window.
  5. The loess curve is calculated by moving the window across the data. Each point on the resulting loess curve is the intersection of a regression line and a vertical line at the centre of such a window.

To calculate a loess curve using R:

 
plot(cars$speed, cars$dist, main="Car Speed and Stopping Distance", xlab="Speed (mph)", ylab="Stopping Distance (ft)")
lines(lowess(cars$speed, cars$dist), col="red")

A scatterplot example of Loess fitting

Generating Test Data

Here, for completeness, is the code I used to generate the graphs I used in my tests:

 
# Parameters
start.ym <- c(2000, 1)
end.ym   <- c(2012,12)
n.years  <- 13

# Set the seed for the randomisation
set.seed(5)

# Create the 2nd derivative of the sales trend
ddtrend.sales <- qnorm(runif(12*n.years, 0.1, 0.90), mean=0, sd=0.4)

# Create the 1st derivative of the sales trend from the 2nd derivative
dtrend.sales    <- rep(NA, 12*n.years)
dtrend.sales[1] <- 0
for (i in 2:(12*n.years)) dtrend.sales[i] <- dtrend.sales[i-1] + ddtrend.sales[i]

# Create the sales trend from the 1st derivative
trend.sales    <- rep(NA, 12*n.years)
trend.sales[1] <- 30
for (i in 2:(12*n.years)){
   trend.sales[i] <- trend.sales[i-1] + dtrend.sales[i]
   if (trend.sales[i] < 0) trend.sales[i] = 0
}

# Create the seasonality
seasonality <- rep(c(10, 30, 22, 32, 26, 14, 2, -15, -14, -13, -16, -2), 13)

# Create the random noise, normally distributed
noise <- qnorm(runif(12*n.years, 0.01, 0.99), mean=0, sd=18)

# To make the noise heteroskedastic, uncomment the following line
# noise <- noise * seq(1, 10, (10-1)/(12*n.years-1))

# Create the sales
sales <- trend.sales + seasonality + noise

# Put everything into a data frame
df.sales <- data.frame(sales, trend.sales, dtrend.sales, ddtrend.sales, seasonality, noise)

# Set graphical parameters and the layout
par(mar = c(0, 4, 0, 2)) # bottom, left, top, right
layout(matrix(c(1,5,2,6,3,7,4,8), 4, 2, byrow = TRUE), widths=c(1,1,1,1,1,1,1,1), heights=c(1,1,1,1,1,1,1,1))

# Plot sales
tseries <- ts(data=df.sales$sales, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Sales (USD, 1000's)", main="", xaxt="n")

# Plot the trend
tseries <- ts(data=df.sales$trend.sales, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Actual Sales Trend (USD, 1000's)", main="", xaxt="n")

# Plot the seasonality
tseries <- ts(data=df.sales$seasonality, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Actual Sales Seasonality (USD, 1000's)", main="", xaxt="n")

# Plot the noise
tseries <- ts(data=df.sales$noise, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Actual Sales Noise (USD, 1000's)", main="", xaxt="n")

# Decompose the sales time series
undecomposed   <- ts(data=df.sales$sales, frequency = 12, start=start.ym, end=end.ym)
decomposed     <- stl(undecomposed, s.window="periodic")
seasonal 	   <- decomposed$time.series[,1]
trend	       <- decomposed$time.series[,2]
remainder	   <- decomposed$time.series[,3]

# Plot sales
tseries <- ts(data=df.sales$sales, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Sales (USD, 1000's)", main="", xaxt="n")

# Plot the decomposed trend
tseries <- ts(data=trend, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Est. Sales Trend (USD, 1000's)", main="", xaxt="n")

# Plot the decomposed seasonality
tseries <- ts(data=seasonal, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Est. Sales Seasonality (USD, 1000's)", main="", xaxt="n")

# Plot the decomposed noise
tseries <- ts(data=remainder, frequency = 12, start=start.ym, end=end.ym)
plot(tseries, ylab="Est. Sales Noise (USD, 1000's)", main="", xaxt="n")

Author: Peter Rosenmai

Posted on June 14, 2014 by Danielle Mosimann

Analytics, Big Data and BI or: How I Learned To Stop Worrying And Love The Cricket

One of the challenges of working for a company like AlignAlytics is explaining exactly what it is that one does all day. Nothing scares off a new potential friend quicker than phrases such as ‘data-driven strategy and insight’, accompanied by some vague hand waving, especially if said hand waving usually sends drinks flying. Typically, after several failed attempts at explaining the concepts of customer segmentation and advanced analytics, the standard fallback response is that we spend our days doing reporting & analysis before moving the conversation on to more interesting topics, such as Justin Bieber turning up really late for gigs or the on/off relationship between those two miserable leads from the Twilight movies.

However, while analytics might appear a foreign concept to many people, the truth of the matter is that it has been part of people’s lives for a long time, even if they didn’t know about it. One particular area of modern life in which analytics is widely used is in sport, specifically in its coverage on television and via digital media. It’s also here that concepts such as big data can be most easily comprehended and explained.

One particular sport close to the heart of this particular author is cricket, a sport built entirely on large numbers of discrete data points. Every single time a ball is bowled, a huge number of different pieces of information are collected – how fast was the delivery? Where did it land? What shot did the batsman play? Which Australian did it dismiss? This is repeated for every ball bowled in every day of (almost) every match around the world. Before you know it we have a genuine example of this mythical big data that everyone has been talking about.

Of course, collecting data for data’s sake can be its own reward – apropos of nothing, nothing impresses a crowd like owning the entire set of classic Doctor Who DVDs – but it’s the interpretation of all this data that is really the key. Hence the proliferation of visual methods on TV and the internet to help commentators or writers provide insight and clarity, such as this example, a pitch map for a specific bowler:

Stuart Broad Pitch Map

Image Supplied by © Hawk-Eye Innovations

Suddenly, and without really thinking about it, we have analytics. And not just that, analytics based on big data. In order to get to those analytics we’ve used specific software to turn our data into something that we can visually comprehend and interpret. And that’s Business Intelligence (BI) software explained at the same time.

Of course, cricket isn’t the only sport to use these concepts. Football (or soccer as it’s occasionally known in the colonies) is a more recent convert to the idea of big data, albeit in a much more ‘closed shop’ way. The likes of Opta and Prozone provide enormous amounts of data around every single football match in the Premier League (and beyond), with every single pass, shot and run recorded in frightening detail. This data is generally not made available to the public, instead being closely guarded behind closed doors by those football clubs that use it (and largely ignored by those that don’t).

Recently however, Manchester City made large amounts of this data available, encouraging members of the public to do their own analysis and trying to create an ‘analytics community’ in which ideas could be shared. Whilst it’s possible to argue about their motives for this – why pay for an analytics team when hardcore fans will do it all for free and then you can steal their ideas? – it’s clear evidence of the growing significance of analytics (and big data) across different areas of everyday life.

To conclude, perhaps the best way to explain what one does all day is to talk about cricket and its approach to big data, analytics and BI. And then, after several hours of explaining the intricacies, such as the difference between the flipper and the topspinner, casually point out that AlignAlytics generally applies these concepts to the marginally less exciting worlds of consumer goods and utilities. We say generally because everyone needs a hobby for their free time, such as tracking Stuart Broad’s Test career over time:

Bowling average vs Batting Average

Author: Ashley Michael

Posted on March 25, 2014 by Danielle Mosimann

Outlier Reporting and Benefits from Unit Testing in R

A recent AlignAlytics analysis project, reliant on Big Data processing and storage, required complex outlier reporting using the R statistical-programming language. This open-source software, combined with in-house statistical skills, allowed the team to quickly produce reports that are now the foundation of an on-going strategic analysis programme.

Unit testing is one part of this story and we hope Peter Rosenmai can continue to share more with us.

Getting started with unit testing in R

Unit testing is an essential means of creating robust code. The basic idea is simple: You write tests that the functions you code are required to fulfil; whenever you thereafter make changes to your code, you can run the tests to ensure that your functions all still work.

Such future-proofing is obviously useful, but unit testing brings other benefits. It forces you to break your code down into discrete, testable units. And the tests provide excellent examples of how your functions should be called. That can be really useful, especially when code commenting is shoddy or out of date.

Here’s an example of unit testing in the R statistical-programming language using the RUnit package. We have a file main.r in our current working directory. That file contains main(), our top-level function:

 
# main.r

# Load in the unit testing package
require(RUnit)

# Load in source files
source("string-utils.r")

# Function to run all unit tests (functions named test.*) in all
# R files in the current working directory
runUnitTests <- function(){
   cat("Running all unit tests (being functions that begin with 'test.')
        in all R files in the current working directory.")

   tests <- defineTestSuite("Tests", dirs=getwd(),
                            testFileRegexp = "^.*.[rR]$",
                            testFuncRegexp = "^test..+")

   test.results <- runTestSuite(tests)

   cat(paste(test.results$Tests$nTestFunc,    " test(s) run.n",
             test.results$Tests$nDeactivated, " test(s) deactivated.n",
             test.results$Tests$nFail,        " test(s) failed.n",
             test.results$Tests$nErr,         " errors reported.n",
             sep=""))

   if((test.results$Tests$nFail > 0) || (test.results$Tests$nErr > 0)){
      stop("Execution halted following unit testing. Fix the
	        above problem(s)!")
   }
}

main <- function(run.unit.tests=TRUE){
   if (run.unit.tests) runUnitTests()

   # Your code here...
}

The above code loads in from our current working directory the file string-utils.r:

 
# string-utils.r

# Load the unit testing package
require(RUnit)

# Function to trim a string
trim <- function(str){
   if (class(str) != "character"){
      stop(paste("trim passed a non string:", str))
   }

   return(gsub("^s+|s+$", "", str))
}
test.trim <- function(){
   checkTrue(trim("  abc ") == "abc")
   checkTrue(trim("a b ")   == "a b")
   checkTrue(trim(" a b")   == "a b")
   checkTrue(trim("")       == "")
   checkException(trim(3), silent=TRUE)
}

We run our top-level function using:

rm(list=ls(all=TRUE)); source("main.R",echo=FALSE); main()

That line removes all variables from the workspace, creates the functions in the above blocks of code and calls main(). The first thing main() does is call runUnitTests() to run all functions with names that start with "test." in all R files in the current working directory. Those are our unit tests.

For example, one of those unit test functions is test.trim(), the function shown above that checks that trim() is working as it should. Note how test.trim() not only checks expected return values but makes sure that trim() throws exceptions when it should. And what does trim() do? The examples in the test code should make it clear—which is why I like to keep the unit tests together with the functions that they test.

The above is the briefest of introductions to a huge topic. I could say a lot more about, for instance, test-driven development, refactoring and code coverage. But my aim here is not that ambitious. If you’re an analyst or a statistician, chances are you haven’t previously heard of unit testing. If that’s the case, I merely wish to suggest that you give the above a try the next time you find yourself coding in R. Unit testing really is worth the effort.

Author: Peter Rosenmai

Posted on January 19, 2014 by Danielle Mosimann

Is Outsourcing The 5th Generation Programming Language?

I think it’s fair to say that I am something of a sceptic when it comes to outsourcing IT. In fact, I wrote my university dissertation on the subject and concluded that it only works for static IT functions (such as running a telephone system) and even then it is not without some significant flaws. It may therefore come as a surprise that I am a recent convert to the potential for outsourcing some of the most dynamic software development undertaken by a company.

The reason for this change of heart is a change in outsourcing itself. My dissertation was based on the outsourcing proposition of the nineties which was typically based on a handshake between a large corporate and an outsourcing giant. A director would decide that their IT department was a black hole for overheads and that the numbers offered by the outsourcing agency were considerably more favourable. They would sign a contract, sit back, and wait for their bottom line to soar. Sadly, many are still waiting.

The traditional model of outsourcing is still out there, and my attitude towards it remains unchanged, however I have recently been introduced to a new way, and it seems much more powerful. New websites offer the ability to find and recruit developers on an individual basis for modules of work as small or large as the hiring company desires. They act like an international IT skills dating agency, allowing overworked IT departments to find underworked professionals from around the world with the exact skills they require for a specific task. The sites offer facilities to ensure that the hired professional can only bill for the time spent working, and that they receive fair payment for that time. This allows both sides to work in a very flexible manner which suits them both. All this is a far cry from the traditional outsourcing model.

Outsourcing

You may wonder why, in the title of this blog, I compare outsourcing to a programming language in itself, and for that we need a (VERY) brief walk through the history of programming languages. Originally computers could only be programmed with 1s or 0s but soon this process was simplified with what have since become known as 2nd generation languages, most notably Assembler. From this an additional level of abstraction was added and 3rd generation languages (3GL) such as C were born making programming faster and easier. It is in 3GL that most programming happens today. In the eighties and nineties the idea of 4GL was popular with the intention that for specific areas of operation the programming language could be so abstracted that a non-programmer could do it. There was some success with this which naturally led onto the concept of fifth generation languages (5GL) where a non-programmer sets required constraints and the algorithms are generated for them.

In its truest sense 5GL is heavily rooted in Artificial Intelligence research; however its aspirations are achieved through modern outsourcing techniques. I write tasks into project management software in plain English which a team of remote developers pick off and complete one by one. So I can write “I would like my form to be blue” and – after a short wait – the form is blue. It certainly feels like a fifth generation language to me.

Author: John Kiernander

Posted on October 24, 2013 by Danielle Mosimann

Is BI Software A Big Game?

Outsourcing
Tableau Public, Mark Types & Classic Amiga/PC/console Mgm Title ‘Theme Park’

After seeing a demo of some of the latest BI (Business Intelligence) software and how they use mapping or geocoding software I was reminded of some of the computer games I grew up playing.

I’ve recently seen demos of various BI tools that are utilising various mapping extensions. The results are really impressive and fantastic at highlighting interesting areas of your business, related to a geographical area. However as I sat through these I began thinking I’d seen this sort of thing before. Probably about 15 years ago. They all remind me of the various computer games that have been around for ages – the specific area was top down strategy if I remember correctly. Games such as SimCity and Theme Park jump out at me as being of a similar style. You’d have a map of your area of interest – whether it was building a city, running a theme park or conquering the world – you’d deal with scarce resources, make decisions based on these restrictions and await the outcomes to react to. Sound familiar? Maybe this isn’t a surprising observation. The people developing BI tools are from a generation that grew up playing these games – actually many are probably from a generation that plays a more modern equivalent but I’m sure the point holds. Also many managers who use BI tools probably grew up using the interfaces of top down strategy games, so does this mean modern BI software development has been driven by some rather old computer games?

At AlignAlytics we actually run a workshop program called ‘The Game’ in conjunction with COGNOS. In simplistic terms it uses the COGNOS BI platform to give you access to a fictional company’s performance data and the markets that they operate in. Utilising their dashboards you learn about the products they sell, how the sales are spread geographically and what sort of resources you have available. You then create a strategy, make choices about resource allocation and see how your decisions pan out. You can then forge on with your original strategy or react to the outcomes of your original choices. As I went through the workshop I was reminded of another classic computer game series – Championship Manager. Here you play a football manager who has limited money. You think about the style you want your team to play (your strategy) buy your players, set your tactics and play your match. After this you can continue to tweak tactics or be consistent with your original plan. Again you see similar themes and interfaces between BI tools and popular computer games.

None of these comparisons are derogatory to BI software. The 2 things are essentially doing the same thing – trying to give you access to as much information as possible in the simplest most presentable way. The complication that BI software has is that the underlying datasets are often much more complex so much more attention is paid to the back-end crunching of data as to the front-end interfaces. Recently however BI software does seem to have made a leap forward in the standard of front-end dashboards, suggesting that companies now see these polished interfaces as an important way of driving effective decision making, alongside the back-end tool kit. In the 1994 computer game ‘Theme Park’ the gamer had to build various rides on his empty land and then hire staff, attract customers and run a profitable park. You’d see your various workers walking to sites to fix or clean them and customers would come and go based on the quality of the park.

Perhaps this 18 year old game will give some clues as to where BI software is going. Could management be looking at real time views of sales reps moving around city maps trying to get to client sites before competitor reps!? As this happens could analysts already be trying to tweak prices and bundle products into contracts to win the deal!? Would customers be seen leaving in droves because of poor customer service!? All this would make business management sound quite fun with perhaps the main caveat being the fact that the concept of having 3 lives might not transfer as easily from the game world to the real world, or would it?

Author: Gus Urquhart

Posted on August 12, 2013 by Danielle Mosimann