Big Data Journey: A few battle scars later


We interviewed Amit and Adam from our Advanced Analytics Innovation Lab – a couple of our data science leaders that have been involved in our Big Data Journey.  They discuss battle scars and what they have learned from the 4vs, data capture, storage and security to Hadoop, Redshift and Elasticsearch.

A Big Data Interview

Q: What do you think has been the biggest challenge? Has it been the integration and unification of some of that data? And the different variety of data we’re dealing with? Is it being able to handle it fast? Is it being able to store it?

I think the first challenge was actually getting our head around where the best place to start was. There’s quite a lot of buzz words and quite a lot of different ideas which are all related to big data. Each project is different. It took us a long time to get to understand all the different components that make up that ecosystem and decide which bits are used first. So when you go into a certain type of project should you start with Hadoop or should you start with Elasticsearch? They do different things. It’s getting yourself up that learning curve of what each of those niches are and where you use them for different things. That was a real learning curve for us. After two years we now feel that we’re there and we’ve done a lot with different systems and feel a lot more comfortable to choose the right tools for the job.

Would you choose Hadoop? Would you choose Redshift? Or if you’re doing visualisation maybe you choose to base it on Elasticsearch because of the quickness? The speed of return of the queries and things like that.

“Ecosystem” is a great description for working with big data because it’s not a linear process. It’s an ecosystem you have to grow. And there isn’t one tool that solves everything. As we’ve been evaluating different tools they’ve been on a journey too. Like Elasticsearch, they’ve come a long way from when we first started using them. They’ve learnt from our use cases in which they saw our speed issue, our storage issues and in terms of some of the adhoc querying as well. But then in terms of ecosystem, we have a lot of SQL analysis tools here as well. People who know SQL. So we have played quite a bit with Redshift as well.

There are different stages in projects. We often start with getting things going really quickly and trying to understand things. And there are certain tools that are a lot quicker to get going with. Tools that are used are based on the client need and the problem we need to solve.

Q: So all of this is constantly changing – we’re constantly evaluating. But how has it been with clients? In my experience it’s been trying to get clients to understand that their data policies need to change. Changing their perception of storage and retention so they are able to defensively delete certain aspects that they don’t need. But also being able to capture things that they might not have captured in the past so as to give a rich story, what they want to read from their analysis. How has your experience been from a technical perspective?

Even beyond a technical perspective it’s about understanding what data you need to capture and how long you need to store it for. We found that actually a lot of companies need to go through the process of trying to understand what they need to get the best value from their data. What they need to keep, how they best store it and the security they need to put in place. Deciding the teams as well, to work with it. With the technical aspects, things like security are not as hard as deciding policies in the beginning and getting that structure in place. We found that going through that process of understanding what you need can be the most difficult thing.

The questions come. Why do we have to do certain things? Why do you want that much data? How are you actually going to transfer that data? So all 4 Vs come into play, not just for us to understand but for the client to understand.

Tackling how you use cloud computing if you have sensitive data. Do you mask the data? What actually is or what isn’t sensitive? The IT infrastructure of a company will have their policies but how do they evaluate if something is secure or not? It’s been an issue we’ve had to work through a number of times.

How you move these massive sums of data between different cloud solutions or even within regions within cloud solutions can often be a problem if you have terabytes and terabytes of data. It can be pretty difficult when keeping security in mind and ensuring that process is secure. We came across some products and tools which can move absolute bucketloads of data in seconds and with security in place which we wouldn’t have come across if we weren’t in this space. That sort of thing’s only possible once you get in there and go on that journey.

Q: How has the leap from traditional SQL to dealing with big data been?

We’ve been through a bit of a transition here. There has been a lot of work around different technologies which aren’t directly related to the big data space which other people have been working on. JavaScript visualisation for example which definitely compliments our work. There’s a lot of different work going on in various areas that all come together in the same sort of field. We’ve progressed as a company in massive leaps and bounds.

Q: Are statistics capabilities easier with the big data ecosystem?

We have a number of stats guys and they’re getting into that role of doing things not just in small scale but on large scale and using the whole set available rather than sampling. Definitely advances in that area as well.

In terms of Big Data, there has been those who say that everyone’s talking about it, nobody really knows how to do it. I like to think we’ve graduated.

In the last 2 years we’ve gone from talking conceptually about big data and all these tools to ticking a lot of different boxes in terms of use cases. The fact that we have different teams working on different projects means we get a large variety of use cases. In that sense we almost know what we’re talking about. The reason I say almost is because it’s an ongoing journey. There are certain things we haven’t touched upon yet and we can always do more!

Recent Posts


Want Insights Fit For Action?

Get in Touch

Give us a call on:

+44 (0)20 8347 3500 (UK)
+1 484 367 0888 (US)