Category: Big Data

An Elasticsearch Journey

Over the last 3-5 years big data technologies have become an increasingly important factor in analytics. A few years ago our company knew that as an advanced analytics provider it was integral for our skills to include working with big data. In order to achieve our best analysis we had to move away from traditional SQL to unstructured data and the team explored different platforms which would enable us to do this. Elasticsearch stood out initially due to it being structure agnostic and its ability to store ALL types of data and so we embarked on a journey of learning with this application. By investing our time and skills into Elasticsearch our company was investing in our long-term abilities, as success in the ever-changing analytics landscape requires growth alongside your tools.

Over the years of using Elasticsearch both our company and Elasticsearch have vastly improved their capabilities and so this piece will cover the key projects in our journey and how Elasticsearch facilitated them. Additional upcoming blog posts will individually explore these big data projects and the multitude of benefits and features of Elasticsearch that enabled them in more detail.

ES is used to represent the term Elasticsearch.

Elasticsearch as a Data Store (2012)

Until a big data proof of concept for a travel industry client we had mostly used SQL. However, SQL has significant cost implications when scaling up and we needed a highly concurrent application to quickly write to a data store without costing an arm and a leg.

CouchDB could get the data in but it wasn’t easy to extract the data in a meaningful way. Although exporting the data from ES (version ~0.10) was a challenge, as a data store it was beneficial in a number of ways:

  • Easy to write data
  • Brilliant search functionality
  • Open source

Elasticsearch as a Reporting Store (2013)

In an analytics and audit project for another online travel agency we analysed web server logs to work out their conversions and success rate from their enormous amount of online data. We needed to, not only write data, but to import, analyse and report 250 million records of unstructured data without, once again, costing an arm and a leg.

From using ES (~0.19) as a reporting store, the team discovered that ES:

  • Allows for a high rate of ingestion
  • Keeps data compact and therefore storage costs low
  • Only enables analysis of data by means of “facets” – an aggregated level of data based on a search query

With limited means of analysis within ES we moved the relevant data to RedShift in order to analyse and segment it. Although this was a high cost option and made up to 10%-15% of our total project cost. However, when it came to reporting we returned to our beloved platform using both ES and their dashboard, Kibana, to provide high standard reporting for the project.

Elasticsearch as an Analytical Store (2014)

We needed to find a better way to analyse the necessary data within ES and so the team developed two different Python libraries for internal analysis using ES, Python and Pandas. One, Pylastic, used ES as a wrapper for Python and the other, Pandastic, used Pylastic as a wrapper for Pandas. We gained:

  • A unified data layer
  • Simplified querying. SQL instead of JSON for our data scientists.
  • Bespoke terms to better fit our company, taking away Elasticsearch jargon.
  • Easy data extraction.

Pylastic Example
On the left, sample raw Elasticseach JSON query. On the right, sample code in Pylastic to write the same query as on the left.

All our data writing, storing, reporting and analysis for big data projects was now possible to achieve using ES which enabled us to deliver our results to clients faster as we didn’t have to move data around as much as in the past.

Elasticsearch and GeoSpatial Analysis

Another big data project was for a client who wanted a database for their salesforce in order to provide knowledge of where to distribute their products in Nigeria. The project involved mapping all outlets in this country. Interviewers in the field had hand devices which recorded surveys, geo locations and images and their area coverage was targeted. The team had to take this vast amount of recorded data and store and analyse it in order to put the database together. Once again ES was chosen over SQL due to:

  • Presence of multiple data sources, including survey data, log data and images. ES can store ALL types of data.
  • A requirement for a large number of fields/columns. (>1000)
  • Geospatial features which supported the necessary geospatial queries. For example, we were able to specify geopoint & polygon data types in ES.
  • Evolving data. ES had no problem with the changing survey data, such as adding new fields, throughout the project.
  • ES could support “live reporting” and fast real time querying.
  • We expected the reporting data to be BIG and ES was well placed to manage this volume. Currently there are over 300 million records and we have only ~1/5 of the country covered.

Interviewer Summary Report
Interviewer Summary Report: shows the path and the recorded stores for each interviewer by day.

Elasticsearch Benchmarking

As ES had become integral to our data and analysis work we were constantly looking into ways in which to improve performance and thus more rapidly deliver valuable insights to our clients. The team engaged in an ES benchmarking exercise with Bigstep’s Full Metal Cloud infrastructure and ran ES queries on 10 million documents (approx 4GB of compressed data). We knew the metal cloud would perform better compared to traditional cloud based dedicated servers but the metal cloud performance results were incredible. Results were consistently 100-200% better than existing infrastructure. As the queries became more complex, such as the geo distance calculations for our geospatial analysis, the positive performance difference was highlighted even further.
Main factors for Bigstep’s superior performance:

  • Wire-speed network
  • Hand-picked components
  • All-SSD storage based on enterprise drives

An Elasticsearch Future

Reviewing our journey with ES demonstrates that ES has been able to deliver on a breadth of projects and empowered us to vastly improve our big data and unstructured data analysis knowledge and skills. It has become a crucial tool in our team and will continue to be so in the future as both our capabilities progress to be even more advanced. Although we’re not really the type of company to make bold proclamations we would probably call ourselves Elasticsearch champions, as our team can’t seem to recommend it enough.

Author: Danielle Mosimann

Posted on June 6, 2015 by Danielle Mosimann

Big Data Journey: A few battle scars later

We interviewed Amit and Adam from our Advanced Analytics Innovation Lab – a couple of our data science leaders that have been involved in our Big Data Journey.  They discuss battle scars and what they have learned from the 4vs, data capture, storage and security to Hadoop, Redshift and Elasticsearch.

A Big Data Interview

Q: What do you think has been the biggest challenge? Has it been the integration and unification of some of that data? And the different variety of data we’re dealing with? Is it being able to handle it fast? Is it being able to store it?

I think the first challenge was actually getting our head around where the best place to start was. There’s quite a lot of buzz words and quite a lot of different ideas which are all related to big data. Each project is different. It took us a long time to get to understand all the different components that make up that ecosystem and decide which bits are used first. So when you go into a certain type of project should you start with Hadoop or should you start with Elasticsearch? They do different things. It’s getting yourself up that learning curve of what each of those niches are and where you use them for different things. That was a real learning curve for us. After two years we now feel that we’re there and we’ve done a lot with different systems and feel a lot more comfortable to choose the right tools for the job.

Would you choose Hadoop? Would you choose Redshift? Or if you’re doing visualisation maybe you choose to base it on Elasticsearch because of the quickness? The speed of return of the queries and things like that.

“Ecosystem” is a great description for working with big data because it’s not a linear process. It’s an ecosystem you have to grow. And there isn’t one tool that solves everything. As we’ve been evaluating different tools they’ve been on a journey too. Like Elasticsearch, they’ve come a long way from when we first started using them. They’ve learnt from our use cases in which they saw our speed issue, our storage issues and in terms of some of the adhoc querying as well. But then in terms of ecosystem, we have a lot of SQL analysis tools here as well. People who know SQL. So we have played quite a bit with Redshift as well.

There are different stages in projects. We often start with getting things going really quickly and trying to understand things. And there are certain tools that are a lot quicker to get going with. Tools that are used are based on the client need and the problem we need to solve.

Q: So all of this is constantly changing – we’re constantly evaluating. But how has it been with clients? In my experience it’s been trying to get clients to understand that their data policies need to change. Changing their perception of storage and retention so they are able to defensively delete certain aspects that they don’t need. But also being able to capture things that they might not have captured in the past so as to give a rich story, what they want to read from their analysis. How has your experience been from a technical perspective?

Even beyond a technical perspective it’s about understanding what data you need to capture and how long you need to store it for. We found that actually a lot of companies need to go through the process of trying to understand what they need to get the best value from their data. What they need to keep, how they best store it and the security they need to put in place. Deciding the teams as well, to work with it. With the technical aspects, things like security are not as hard as deciding policies in the beginning and getting that structure in place. We found that going through that process of understanding what you need can be the most difficult thing.

The questions come. Why do we have to do certain things? Why do you want that much data? How are you actually going to transfer that data? So all 4 Vs come into play, not just for us to understand but for the client to understand.

Tackling how you use cloud computing if you have sensitive data. Do you mask the data? What actually is or what isn’t sensitive? The IT infrastructure of a company will have their policies but how do they evaluate if something is secure or not? It’s been an issue we’ve had to work through a number of times.

How you move these massive sums of data between different cloud solutions or even within regions within cloud solutions can often be a problem if you have terabytes and terabytes of data. It can be pretty difficult when keeping security in mind and ensuring that process is secure. We came across some products and tools which can move absolute bucketloads of data in seconds and with security in place which we wouldn’t have come across if we weren’t in this space. That sort of thing’s only possible once you get in there and go on that journey.

Q: How has the leap from traditional SQL to dealing with big data been?

We’ve been through a bit of a transition here. There has been a lot of work around different technologies which aren’t directly related to the big data space which other people have been working on. JavaScript visualisation for example which definitely compliments our work. There’s a lot of different work going on in various areas that all come together in the same sort of field. We’ve progressed as a company in massive leaps and bounds.

Q: Are statistics capabilities easier with the big data ecosystem?

We have a number of stats guys and they’re getting into that role of doing things not just in small scale but on large scale and using the whole set available rather than sampling. Definitely advances in that area as well.

In terms of Big Data, there has been those who say that everyone’s talking about it, nobody really knows how to do it. I like to think we’ve graduated.

In the last 2 years we’ve gone from talking conceptually about big data and all these tools to ticking a lot of different boxes in terms of use cases. The fact that we have different teams working on different projects means we get a large variety of use cases. In that sense we almost know what we’re talking about. The reason I say almost is because it’s an ongoing journey. There are certain things we haven’t touched upon yet and we can always do more!

Posted on March 18, 2015 by Danielle Mosimann

Analytics, Big Data and BI or: How I Learned To Stop Worrying And Love The Cricket

One of the challenges of working for a company like AlignAlytics is explaining exactly what it is that one does all day. Nothing scares off a new potential friend quicker than phrases such as ‘data-driven strategy and insight’, accompanied by some vague hand waving, especially if said hand waving usually sends drinks flying. Typically, after several failed attempts at explaining the concepts of customer segmentation and advanced analytics, the standard fallback response is that we spend our days doing reporting & analysis before moving the conversation on to more interesting topics, such as Justin Bieber turning up really late for gigs or the on/off relationship between those two miserable leads from the Twilight movies.

However, while analytics might appear a foreign concept to many people, the truth of the matter is that it has been part of people’s lives for a long time, even if they didn’t know about it. One particular area of modern life in which analytics is widely used is in sport, specifically in its coverage on television and via digital media. It’s also here that concepts such as big data can be most easily comprehended and explained.

One particular sport close to the heart of this particular author is cricket, a sport built entirely on large numbers of discrete data points. Every single time a ball is bowled, a huge number of different pieces of information are collected – how fast was the delivery? Where did it land? What shot did the batsman play? Which Australian did it dismiss? This is repeated for every ball bowled in every day of (almost) every match around the world. Before you know it we have a genuine example of this mythical big data that everyone has been talking about.

Of course, collecting data for data’s sake can be its own reward – apropos of nothing, nothing impresses a crowd like owning the entire set of classic Doctor Who DVDs – but it’s the interpretation of all this data that is really the key. Hence the proliferation of visual methods on TV and the internet to help commentators or writers provide insight and clarity, such as this example, a pitch map for a specific bowler:

Stuart Broad Pitch Map

Image Supplied by © Hawk-Eye Innovations

Suddenly, and without really thinking about it, we have analytics. And not just that, analytics based on big data. In order to get to those analytics we’ve used specific software to turn our data into something that we can visually comprehend and interpret. And that’s Business Intelligence (BI) software explained at the same time.

Of course, cricket isn’t the only sport to use these concepts. Football (or soccer as it’s occasionally known in the colonies) is a more recent convert to the idea of big data, albeit in a much more ‘closed shop’ way. The likes of Opta and Prozone provide enormous amounts of data around every single football match in the Premier League (and beyond), with every single pass, shot and run recorded in frightening detail. This data is generally not made available to the public, instead being closely guarded behind closed doors by those football clubs that use it (and largely ignored by those that don’t).

Recently however, Manchester City made large amounts of this data available, encouraging members of the public to do their own analysis and trying to create an ‘analytics community’ in which ideas could be shared. Whilst it’s possible to argue about their motives for this – why pay for an analytics team when hardcore fans will do it all for free and then you can steal their ideas? – it’s clear evidence of the growing significance of analytics (and big data) across different areas of everyday life.

To conclude, perhaps the best way to explain what one does all day is to talk about cricket and its approach to big data, analytics and BI. And then, after several hours of explaining the intricacies, such as the difference between the flipper and the topspinner, casually point out that AlignAlytics generally applies these concepts to the marginally less exciting worlds of consumer goods and utilities. We say generally because everyone needs a hobby for their free time, such as tracking Stuart Broad’s Test career over time:

Bowling average vs Batting Average

Author: Ashley Michael

Posted on March 25, 2014 by Danielle Mosimann