We interviewed Amit and Adam from our Advanced Analytics Innovation Lab – a couple of our data science leaders that have been involved in our Big Data Journey. They discuss battle scars and what they have learned from the 4vs, data capture, storage and security to Hadoop, Redshift and Elasticsearch.
Q: What do you think has been the biggest challenge? Has it been the integration and unification of some of that data? And the different variety of data we’re dealing with? Is it being able to handle it fast? Is it being able to store it?
I think the first challenge was actually getting our head around where the best place to start was. There’s quite a lot of buzz words and quite a lot of different ideas which are all related to big data. Each project is different. It took us a long time to get to understand all the different components that make up that ecosystem and decide which bits are used first. So when you go into a certain type of project should you start with Hadoop or should you start with Elasticsearch? They do different things. It’s getting yourself up that learning curve of what each of those niches are and where you use them for different things. That was a real learning curve for us. After two years we now feel that we’re there and we’ve done a lot with different systems and feel a lot more comfortable to choose the right tools for the job.
Would you choose Hadoop? Would you choose Redshift? Or if you’re doing visualisation maybe you choose to base it on Elasticsearch because of the quickness? The speed of return of the queries and things like that.
“Ecosystem” is a great description for working with big data because it’s not a linear process. It’s an ecosystem you have to grow. And there isn’t one tool that solves everything. As we’ve been evaluating different tools they’ve been on a journey too. Like Elasticsearch, they’ve come a long way from when we first started using them. They’ve learnt from our use cases in which they saw our speed issue, our storage issues and in terms of some of the adhoc querying as well. But then in terms of ecosystem, we have a lot of SQL analysis tools here as well. People who know SQL. So we have played quite a bit with Redshift as well.
There are different stages in projects. We often start with getting things going really quickly and trying to understand things. And there are certain tools that are a lot quicker to get going with. Tools that are used are based on the client need and the problem we need to solve.
Q: So all of this is constantly changing – we’re constantly evaluating. But how has it been with clients? In my experience it’s been trying to get clients to understand that their data policies need to change. Changing their perception of storage and retention so they are able to defensively delete certain aspects that they don’t need. But also being able to capture things that they might not have captured in the past so as to give a rich story, what they want to read from their analysis. How has your experience been from a technical perspective?
Even beyond a technical perspective it’s about understanding what data you need to capture and how long you need to store it for. We found that actually a lot of companies need to go through the process of trying to understand what they need to get the best value from their data. What they need to keep, how they best store it and the security they need to put in place. Deciding the teams as well, to work with it. With the technical aspects, things like security are not as hard as deciding policies in the beginning and getting that structure in place. We found that going through that process of understanding what you need can be the most difficult thing.
The questions come. Why do we have to do certain things? Why do you want that much data? How are you actually going to transfer that data? So all 4 Vs come into play, not just for us to understand but for the client to understand.
Tackling how you use cloud computing if you have sensitive data. Do you mask the data? What actually is or what isn’t sensitive? The IT infrastructure of a company will have their policies but how do they evaluate if something is secure or not? It’s been an issue we’ve had to work through a number of times.
How you move these massive sums of data between different cloud solutions or even within regions within cloud solutions can often be a problem if you have terabytes and terabytes of data. It can be pretty difficult when keeping security in mind and ensuring that process is secure. We came across some products and tools which can move absolute bucketloads of data in seconds and with security in place which we wouldn’t have come across if we weren’t in this space. That sort of thing’s only possible once you get in there and go on that journey.
Q: How has the leap from traditional SQL to dealing with big data been?
Q: Are statistics capabilities easier with the big data ecosystem?
We have a number of stats guys and they’re getting into that role of doing things not just in small scale but on large scale and using the whole set available rather than sampling. Definitely advances in that area as well.
In terms of Big Data, there has been those who say that everyone’s talking about it, nobody really knows how to do it. I like to think we’ve graduated.