Tuesday, May 9, 2017

The next stage of BigData

Right now, the terms BigData and Hadoop are used as one and the same - often like the buzzword of buzzwords. And they sound mostly as a last time call, often made by agencies to convince people to start the Hadoop journey before the train leaves the station. Don’t fall into that trap.

Hadoop was made by people who worked in the early internet industry, namely Yahoo. They crawled millions of millions web pages every day, but had no system to really get benefit from this information. Dug Cutting created Hadoop, a Map/Reduce framework written in Java and blueprinted by Google in 2004 (1). The main purpose was to work effectively with an ultra-large set of data and group them by topics (just to simplify).
Hadoop is now 10 years old. And in these 10 years the gravity of data management, wrangling and analyzing runs faster and faster. New approaches, tools and techniques emerging every day in the brain centered areas called Something-Valley. All of those targeting the way we work and think with data.

That describes the main problem of Hadoop itself – it’s designed as an inner working system, providing storage and computation layer at once. And that’s why Hadoop Distributions typically are suggesting to use BareMetal installations in a Datacenter and push companies to create the next silo'd world, promising the good end after leave another one (separate DWH’s without connection between each other). That comes with dramatic costs, operations and a workforce of highly trained engineers, among high costs of connecting systems on premise to the new silo'd DataLake approach, often mixed up with lift-and-shift operations. And here arises the next big problem described as “data gravity”. Data simply sinks down the lake until nobody can even remember what kind of data that was and how the analytical part can be done. And here the Hadoop journey mostly ends. A third issue comes up, driven by agencies to convince companies to invest into Hadoop and Hardware. The talent war. In the end it simply creates the next closed world, but now named a bit fancier.

The world spins further, right now in the direction public cloud, but targeting device edge computing (IoT) and DCC (DataCenter on a chip). Additionally, the kind of data changes dramatically from large chunks of data (PB on stored files from archives, crawler, logfiles) to streamed data delivered by millions of millions edge computing devices. Just dumping data in a lake without visions behind getting cheap storage doesn’t help to solve the problems companies face in their digital journey.

Coming along with the art of data, the need for data analyzing changes with the kind of data creation and ingestion. The first analysis will be done on the edge, the second during the ingestion stream and the next one(s) when the data comes to rest. The DataLake is the central core and will be the final endpoint to store data, but the data needs to get categorized and catalogued during the stream analytics and stored with a schema and data description. The key point in a so-called Zeta-Architecture is the independence of each tool, the “slice it down” approach. The fundamental basic is the data centered business around a data lake, but the choice of tools getting data to the lake, analyze and visualize them aren’t written in stone and independent from the central core.

That opens the possibilities to really get advantage of any kind of data, to open new revenues and sales streams and to finally see all data driven activity not as a cost saving project (as the most agencies and vendors promise) but as a revenue creation project. Using modern cloud technologies moves organizations into the data centric world, focusing on business and not operations.

(1) https://research.google.com/archive/mapreduce.html