Category: ElasticSearch

The Changing Face of Vendor Analytics

Our most recent vendor project was an interesting change in direction compared to several vendor related projects we have previously worked on. We were asked to build out a vendor reporting capability that went beyond simple spend analytics and also brought in data from online sources such as Twitter, Google, Bloomberg, Reuters and Facebook.

This project brought forward interesting trends not just in the area of vendor analytics but also in how datasets that underpin traditional reporting areas such as sales and budgeting are likely to expand in scope to include more and more data from online sources.

Some companies are constantly meeting vendors and they need to make sure that they are asking the right questions and signing off the correct deals in these meetings. For this they need their staff to understand more than just the historical spending with the vendor. They need to know how that company is represented in the news, what key events the vendor has been involved in, such as mergers or financial results, and what people are saying about them.

Historically BI solutions have focused on summarizing and visualizing the internal data side of the business – sales, spending, CRM… Users would then supplement this with their own knowledge and research of customers, competitors and suppliers to build-up an understanding of their environment. Recently however, our aim was to improve how users gather information from research. In order to achieve this, a BI solution needs to capture a wide variety of data sources which are then analysed, aggregated and presented back to the user in an easily understood way. Then, by automatically combining this with spend data you can also allow users to better understand the relationships between datasets.

New role of BI Solutions

In order to pull all this together, there are 4 key areas that need to be built out:

    • Text mining – you need to find a way of summarizing the large amounts of unstructured content that are brought in from online data sources – after reviewing several options we went with AlchemyAPI.
    • Data mashing – a more traditional database layer is needed to combine summary results from the unstructured data with internal vendor spend data – for this we stuck with SQL Server.
    • Reporting layer – To deliver the solution we used Tableau to create a series of reports that allowed users to interact with the combined data.

 

Our final architecture looked something like this:

Data flow architecture

 

This project has led to several useful findings:

    • Overall the area of vendor analytics is enhanced by blending the spend data with online data sources. Events such as a vendor being acquired by another company, a successful project collaboration or a sales event need to be visible by the output from a tool.
    • The ability of AlchemyAPI to mine insights from text content is critical. This includes sentiment analysis but also tackles entity extraction – the process of relating people, places, companies and events to articles.
    • With AlchemyAPI you don’t have to store the content of every article (which is also why we chose it as the best tool for text analytics). You can simply send AlchemyAPI the URL to the relevant article and they analyse the content – other solutions require you to capture the full content or an article and send it to their applications.
    • ElasticSearch delivers what is needed from a NoSQL database with its flexibility to store and analyse large scale unstructured data from multiple sources. Its ability to allow multiple processes to collate and analyze data, simultaneously in real time, gives it significant advantages over other data storage solutions.
    • ElasticSearch delivers what is needed from a NoSQL database with its flexibility to store and analyse large scale unstructured data from multiple sources. Its ability to allow multiple processes to collate and analyze data, simultaneously in real time, gives it significant advantages over other data storage solutions.
    • Having built several solutions in Tableau we are aware of its traditional strengths. However, for this kind of project it is the ability to store web links in a dashboard which users can then access that is particularly useful. So if a spike in negative sentiment occurs for a supplier, a user can quickly navigate from a trend chart in Tableau to a summary of the articles content, again stored in Tableau, to ultimately to the most useful articles online.

 

In conclusion, we found that the area of vendor analytics can be enhanced by combining traditional spend data with online content. The process of combining unstructured online data with spend and sales data is likely to become the norm in future BI developments as companies seek to fill in the gaps that internal data cannot answer on their own.

 

Author: Angus Urquhart

Posted on March 1, 2016 by Danielle Mosimann

An Elasticsearch Journey

Over the last 3-5 years big data technologies have become an increasingly important factor in analytics. A few years ago our company knew that as an advanced analytics provider it was integral for our skills to include working with big data. In order to achieve our best analysis we had to move away from traditional SQL to unstructured data and the team explored different platforms which would enable us to do this. Elasticsearch stood out initially due to it being structure agnostic and its ability to store ALL types of data and so we embarked on a journey of learning with this application. By investing our time and skills into Elasticsearch our company was investing in our long-term abilities, as success in the ever-changing analytics landscape requires growth alongside your tools.

Over the years of using Elasticsearch both our company and Elasticsearch have vastly improved their capabilities and so this piece will cover the key projects in our journey and how Elasticsearch facilitated them. Additional upcoming blog posts will individually explore these big data projects and the multitude of benefits and features of Elasticsearch that enabled them in more detail.

ES is used to represent the term Elasticsearch.

Elasticsearch as a Data Store (2012)

Until a big data proof of concept for a travel industry client we had mostly used SQL. However, SQL has significant cost implications when scaling up and we needed a highly concurrent application to quickly write to a data store without costing an arm and a leg.

CouchDB could get the data in but it wasn’t easy to extract the data in a meaningful way. Although exporting the data from ES (version ~0.10) was a challenge, as a data store it was beneficial in a number of ways:

  • Easy to write data
  • Brilliant search functionality
  • Open source

Elasticsearch as a Reporting Store (2013)

In an analytics and audit project for another online travel agency we analysed web server logs to work out their conversions and success rate from their enormous amount of online data. We needed to, not only write data, but to import, analyse and report 250 million records of unstructured data without, once again, costing an arm and a leg.

From using ES (~0.19) as a reporting store, the team discovered that ES:

  • Allows for a high rate of ingestion
  • Keeps data compact and therefore storage costs low
  • Only enables analysis of data by means of “facets” – an aggregated level of data based on a search query

With limited means of analysis within ES we moved the relevant data to RedShift in order to analyse and segment it. Although this was a high cost option and made up to 10%-15% of our total project cost. However, when it came to reporting we returned to our beloved platform using both ES and their dashboard, Kibana, to provide high standard reporting for the project.

Elasticsearch as an Analytical Store (2014)

We needed to find a better way to analyse the necessary data within ES and so the team developed two different Python libraries for internal analysis using ES, Python and Pandas. One, Pylastic, used ES as a wrapper for Python and the other, Pandastic, used Pylastic as a wrapper for Pandas. We gained:

  • A unified data layer
  • Simplified querying. SQL instead of JSON for our data scientists.
  • Bespoke terms to better fit our company, taking away Elasticsearch jargon.
  • Easy data extraction.

Pylastic Example
On the left, sample raw Elasticseach JSON query. On the right, sample code in Pylastic to write the same query as on the left.

All our data writing, storing, reporting and analysis for big data projects was now possible to achieve using ES which enabled us to deliver our results to clients faster as we didn’t have to move data around as much as in the past.

Elasticsearch and GeoSpatial Analysis

Another big data project was for a client who wanted a database for their salesforce in order to provide knowledge of where to distribute their products in Nigeria. The project involved mapping all outlets in this country. Interviewers in the field had hand devices which recorded surveys, geo locations and images and their area coverage was targeted. The team had to take this vast amount of recorded data and store and analyse it in order to put the database together. Once again ES was chosen over SQL due to:

  • Presence of multiple data sources, including survey data, log data and images. ES can store ALL types of data.
  • A requirement for a large number of fields/columns. (>1000)
  • Geospatial features which supported the necessary geospatial queries. For example, we were able to specify geopoint & polygon data types in ES.
  • Evolving data. ES had no problem with the changing survey data, such as adding new fields, throughout the project.
  • ES could support “live reporting” and fast real time querying.
  • We expected the reporting data to be BIG and ES was well placed to manage this volume. Currently there are over 300 million records and we have only ~1/5 of the country covered.

Interviewer Summary Report
Interviewer Summary Report: shows the path and the recorded stores for each interviewer by day.

Elasticsearch Benchmarking

As ES had become integral to our data and analysis work we were constantly looking into ways in which to improve performance and thus more rapidly deliver valuable insights to our clients. The team engaged in an ES benchmarking exercise with Bigstep’s Full Metal Cloud infrastructure and ran ES queries on 10 million documents (approx 4GB of compressed data). We knew the metal cloud would perform better compared to traditional cloud based dedicated servers but the metal cloud performance results were incredible. Results were consistently 100-200% better than existing infrastructure. As the queries became more complex, such as the geo distance calculations for our geospatial analysis, the positive performance difference was highlighted even further.
Main factors for Bigstep’s superior performance:

  • Wire-speed network
  • Hand-picked components
  • All-SSD storage based on enterprise drives

An Elasticsearch Future

Reviewing our journey with ES demonstrates that ES has been able to deliver on a breadth of projects and empowered us to vastly improve our big data and unstructured data analysis knowledge and skills. It has become a crucial tool in our team and will continue to be so in the future as both our capabilities progress to be even more advanced. Although we’re not really the type of company to make bold proclamations we would probably call ourselves Elasticsearch champions, as our team can’t seem to recommend it enough.

Author: Danielle Mosimann

Posted on June 6, 2015 by Danielle Mosimann