Hadoop: Where do we go from here?

By T. M. Ravi (The Hive) and Dhruba Borthakur (Facebook)

In just over 7 years Hadoop has progressed from a research project on web scale search to a widespread foundational infrastructure and analytics platform for many Web 2.0 companies and, increasingly, enterprises. So where does Hadoop go next?

Hadoop was created to address the problem of massive storage and processing scalability at very low costs, by:

• Moving computation to the data instead of moving the data to the compute
• Distributing and scheduling the computation in a clustered scale-out environment
• Adding resiliency in software so commodity hardware becomes practical

Along the way, Hadoop’s ability to ingest and store unstructured data and then define a schema later, at the time the data is consumed – reversing the order of traditional data processing – has helped spawn a major rethink in data management.

Nevertheless, today Hadoop is not without its challenges. Some of these derive from its success, such as increasing adoption of Hadoop in production, and its use in larger (e.g. Facebook) and more diverse environments (e.g. enterprises). Others relate to the changing nature of today’s infrastructure such as new hardware and Cloud platforms.

The future of Hadoop will be defined by how it rises to these new challenges. Let’s look at them in detail.

1. Ease of Operations – Implementation of Hadoop and integration of data sources is an opportunity ripe for mainstream services organizations working with the vendors providing Hadoop distributions. However, we have ways to go when it comes to keeping Hadoop running. Tools for debugging and trouble shooting on failure need to continue to evolve. Recovery from name node failure and the failover process needs to become faster and more seamless. Ditto with the update and upgrade of various components of Hadoop. How does a customer understand precisely how much data was lost due to a failure? How do you backup a cluster and manage all the different copies of data?

2. Scalability of Control Path – The fundamental innovation in Hadoop was in the data path and now the emphasis needs to be on scaling the control path that is becoming a bottleneck for large data sets. NameNode scalability and reliability is an area that has received much attention with solutions ranging from “Failover from Hot Standby” where NameNodes have to be continuously synchronized (Cloudera), distributed and replicated NameNodes (MapR), and federated naming approaches (Hortonworks). The scaling of JobTracker that farms out MapReduce tasks to specific nodes in the cluster, and the management and federation across multiple clusters each with its own name space are some areas where Hadoop is evolving.

3. Hardware Evolution – The hardware assumptions around which Hadoop was designed will be quite different from what will be available in the next few years. How does Hadoop change with the availability of 24-32 core processors, 10 Gb network cards, 40 Gb rack switches and flash storage? How does Hadoop embrace in-memory techniques and get latency benefits similar to Berkeley Spark and Shark? Will MapReduce that is built on optimizing computation in an unbalanced network be less critical as networks become more balanced? In environments such as Facebook where large amounts of data are being amassed, the efficiency of using resources such as storage has become a concern and a focus for development.

4. Hadoop in the Cloud – The cloud provides a very different substrate to run large-scale data systems. It makes it easy to provision and elastically scale the size of a cluster. However many assumptions of physical clusters are no longer valid. For example, Hadoop assumes local storage but data in Amazon AWS Elastic Block Storage can be located anywhere. While Hadoop faces some challenges in the cloud there are however some significant opportunities for innovation.

5. New Query Systems – Going beyond MapReduce based Pig and Hive, Cloudera Impala and Apache Drill (with contributions by MapR) are new query systems inspired by Google Dremel. Google Dremel has shown the value of a scalable, interactive ad-hoc query system that combines multi-level execution trees and columnar data layout.

6. New Use Case:Real-Time Pipelines– Extending Hadoop as a platform for large-scale batch analytics to a platform for real-time applications has great promise. HBase, the real-time key-value store on top of HDFS, can enable a real-time messaging bus for real-time analytics and decision making similar to the approaches taken by Continuity, Twitter Storm and Facebook Puma.

7. Hadoop ≠MapReduce – The MapReduce programming model is not a great fit for graph algorithms that are becoming common for computation in social, web, location etc. problems. Google Pregel and Apache Giraph are examples of systems that enable more complex pipelines for data with flexibility as to how data is moved around and how frequently it is moved.

8. Apps on Hadoop – Hadoop will be a mainstream platform when packaged applications that deliver business value to users become widely available. The first generation of Hadoop applications powered search, social and other Web2.0 companies whose core products were based on data. A variety of traditional and web enterprises are leveraging Hadoop for custom solutions addressing very specific problems in companies. The next wave of application opportunities and use cases of Hadoop will address functional domains such as IT operational management, security, fraud detection, CRM, multi-channel marketing, advertising, etc. or vertical domains such as healthcare, financial, transportation, energy, etc.

What do you think are some of the areas where Hadoop will significantly evolve?

Join us at The Hive on Wednesday, Jan 16th as we sit down with experts from founders and executives from Cloudera, Continuity, Hortonworks and MapR for a provocative discussion that explores the future of Hadoop, how it will get adopted in the enterprise and the unique market visions that each of these vendors have.

The Hive incubates, funds and launches companies that use data for intelligent applications and provides a shared Big Data platform for application entrepreneurs who can hit the ground running to process huge amounts of data without worrying about the complexities of setting up a Big Data infrastructure themselves.

Special thanks to Prakash Khemani of Facebook for his valuable insights.

2 comments to Hadoop: Where do we go from here?

  • Looking at this clearly and accurately stated set of Hadoop challenges from a commercial market perspective – needs for improvements in security, info and infrastructure management, integration, real-time, memory/core and true cloud elasticity optimization (not just aaS), native analytics and model management, and the ultimate goal of apps – underscores the need for actively drawing in the legacy (aka non-Hadoop) big data communities into the innovation circle. Too much to chew in a single project context. The risk otherwise is for Hadoop to receive merely lip service from the world’s largest BI/analytics players as they re-envelop the more energized market. Counter to hype BTW, our very recent spending intentions data do not suggest that budgets for services-heavy analytics projects are common – they exist but are the exception. So Hadoop focused on being a more holistic platform in terms of development and deployment, and thus ultimately apps and productivity, look like key opportunities. Either that, or let Hadoop remain a domain for the specialist, and let the market bifurcate with Hadoop focused mainly on Web 2.0 types. That said, note that Java never really took off until it emerged with a set of enterprise class runtimes and tools that enabled a WIDE set of developers.

  • Nigel

    The 8 challenges you list are good ones. I think Community vs. Commercial interests might be a 9th given the number of companies embracing/claiming Hadoop for own.

Leave a Reply