Over the last 3-5 years big data technologies have become an increasingly important factor in analytics. A few years ago our company knew that as an advanced analytics provider it was integral for our skills to include working with big data. In order to achieve our best analysis we had to move away from traditional SQL to unstructured data and the team explored different platforms which would enable us to do this. Elasticsearch stood out initially due to it being structure agnostic and its ability to store ALL types of data and so we embarked on a journey of learning with this application. By investing our time and skills into Elasticsearch our company was investing in our long-term abilities, as success in the ever-changing analytics landscape requires growth alongside your tools.
Over the years of using Elasticsearch both our company and Elasticsearch have vastly improved their capabilities and so this piece will cover the key projects in our journey and how Elasticsearch facilitated them. Additional upcoming blog posts will individually explore these big data projects and the multitude of benefits and features of Elasticsearch that enabled them in more detail.
ES is used to represent the term Elasticsearch.
Elasticsearch as a Data Store (2012)
Until a big data proof of concept for a travel industry client we had mostly used SQL. However, SQL has significant cost implications when scaling up and we needed a highly concurrent application to quickly write to a data store without costing an arm and a leg.
CouchDB could get the data in but it wasn’t easy to extract the data in a meaningful way. Although exporting the data from ES (version ~0.10) was a challenge, as a data store it was beneficial in a number of ways:
- Easy to write data
- Brilliant search functionality
- Open source
Elasticsearch as a Reporting Store (2013)
In an analytics and audit project for another online travel agency we analysed web server logs to work out their conversions and success rate from their enormous amount of online data. We needed to, not only write data, but to import, analyse and report 250 million records of unstructured data without, once again, costing an arm and a leg.
From using ES (~0.19) as a reporting store, the team discovered that ES:
- Allows for a high rate of ingestion
- Keeps data compact and therefore storage costs low
- Only enables analysis of data by means of “facets” – an aggregated level of data based on a search query
With limited means of analysis within ES we moved the relevant data to RedShift in order to analyse and segment it. Although this was a high cost option and made up to 10%-15% of our total project cost. However, when it came to reporting we returned to our beloved platform using both ES and their dashboard, Kibana, to provide high standard reporting for the project.
Elasticsearch as an Analytical Store (2014)
We needed to find a better way to analyse the necessary data within ES and so the team developed two different Python libraries for internal analysis using ES, Python and Pandas. One, Pylastic, used ES as a wrapper for Python and the other, Pandastic, used Pylastic as a wrapper for Pandas. We gained:
- A unified data layer
- Simplified querying. SQL instead of JSON for our data scientists.
- Bespoke terms to better fit our company, taking away Elasticsearch jargon.
- Easy data extraction.
All our data writing, storing, reporting and analysis for big data projects was now possible to achieve using ES which enabled us to deliver our results to clients faster as we didn’t have to move data around as much as in the past.
Elasticsearch and GeoSpatial Analysis
Another big data project was for a client who wanted a database for their salesforce in order to provide knowledge of where to distribute their products in Nigeria. The project involved mapping all outlets in this country. Interviewers in the field had hand devices which recorded surveys, geo locations and images and their area coverage was targeted. The team had to take this vast amount of recorded data and store and analyse it in order to put the database together. Once again ES was chosen over SQL due to:
- Presence of multiple data sources, including survey data, log data and images. ES can store ALL types of data.
- A requirement for a large number of fields/columns. (>1000)
- Geospatial features which supported the necessary geospatial queries. For example, we were able to specify geopoint & polygon data types in ES.
- Evolving data. ES had no problem with the changing survey data, such as adding new fields, throughout the project.
- ES could support “live reporting” and fast real time querying.
- We expected the reporting data to be BIG and ES was well placed to manage this volume. Currently there are over 300 million records and we have only ~1/5 of the country covered.
As ES had become integral to our data and analysis work we were constantly looking into ways in which to improve performance and thus more rapidly deliver valuable insights to our clients. The team engaged in an ES benchmarking exercise with Bigstep’s Full Metal Cloud infrastructure and ran ES queries on 10 million documents (approx 4GB of compressed data). We knew the metal cloud would perform better compared to traditional cloud based dedicated servers but the metal cloud performance results were incredible. Results were consistently 100-200% better than existing infrastructure. As the queries became more complex, such as the geo distance calculations for our geospatial analysis, the positive performance difference was highlighted even further.
Main factors for Bigstep’s superior performance:
- Wire-speed network
- Hand-picked components
- All-SSD storage based on enterprise drives
An Elasticsearch Future
Reviewing our journey with ES demonstrates that ES has been able to deliver on a breadth of projects and empowered us to vastly improve our big data and unstructured data analysis knowledge and skills. It has become a crucial tool in our team and will continue to be so in the future as both our capabilities progress to be even more advanced. Although we’re not really the type of company to make bold proclamations we would probably call ourselves Elasticsearch champions, as our team can’t seem to recommend it enough.
Author: Danielle Mosimann