Companies that need to work with large sets of data have a range of big data, open-source frameworks and solutions from which to choose. Each solution has a different set of advantages, disadvantages and ideal applications.
If you’re new to Big Data, you may have heard some of these terms. Below we provide a brief overview and answers to some of the most common questions about Big Data frameworks and solutions. Note: this blog post maybe better for beginners or new learners. Advanced users might find the comparisons and use cases thought-provoking or a re-fresh. Look for future blog posts and deep dives on these topics.
Hadoop is an open-source, distributed computing technology. This configuration enables users to distribute processing of large data sets across clusters of computers to run simple programming models.
Users can scale Hadoop from a single server to thousands of servers. And, each connected server can provide its own local computation and storage.
In many situations, when folks refer to Hadoop, they are speaking about the Hadoop Distributed File System (HDFS), a distributed file system providing a high-throughput entry to application data.
Hadoop’s advantages include:
- A distributed functionally, making networks more robust, if one cluster fails, it continues to run
- Efficient, doesn’t require applications to send huge amounts of data across the network
- Provides linear scaling in the ideal case, enabling easier design
- HDFS can store a lot of data
Some organizations, constrained by budget or time, select to implement Hadoop because it can process and store large data sets very quickly.
MapReduce is a core component of Hadoop (along with HDFS and Yet Another Resource Negotiated, or YARN).
Google introduced MapReduce as a programming model to facilitate its search processes. This framework serves two basic functions: It parcels out work to various nodes within a map (or cluster), and it organizes the workload to reduce the results into a relevant answer to a query.
One of the limitations of Hadoop MapReduce is that the framework can only batch process one job at a time.
Apache Spark Overview
Apache Spark™ is a next-generation, open-source processing engine that combines batch, streaming, and interactive analytics on all the data in one platform via in-memory capabilities. Spark facilitates ease of use, providing the ability to quickly write applications with built-in operators and APIs, along with faster performance and implementation.
It also facilitates robust analytics, with out-of-the-box algorithms and functions for complex analytics, machine learning and interactive queries. Spark’s main components include:
- Spark Core,
- Spark SQL (with DataFrames),
- Spark Streaming,
- Spark MLlib (including ML Pipelines), and
- GraphX (including GraphFrames).
Spark also supports variety of languages such as Java, Python and Scala and has over 1000 developers contributing to the solution.
This framework can run on top of existing Hadoop clusters. It can process structured data in Hive and stream data from a variety of sources (HDFS, Flume, Kafka, etc.)
Spark’s advantages include:
- Integrated advanced analytics
- Use of data parallel processing
- More efficient than MapReduce
- Continuous micro-batch processing based on its own streaming API
- Significantly faster than Hadoop MapReduce for certain use cases
One complaint users have of Spark is that it can be memory hog if jobs are not tuned well. New users, unfamiliar with Spark’s nuances, can often encounter out-of-memory errors. Spark also lacks its own storage so most Spark users install it on Hadoop to take advantage of Hadoop’s HDFS.
Apache Flink Overview
Flink provides a true data streaming platform that uses high-performance dataflow architecture. It is also a strong tool for batch processing since Flink handles batch as a special case of streaming. Flink processes streaming data in real-time streams by pipelining data elements as soon as the elements arrive.
With its strong compatibility option, Flink enables developers to use their existing processes on MapReduce, Storm, etc. directly in Flink’s execution engine.
Flink’s advantages include:
- A true stream processing framework
- The use of algorithms in both streaming and batch modes
- An aggressive optimization engine
- Speedier processing
- The ability to run existing MapReduce jobs directly
Flink has shown much promise and has even captured some of the original Spark streaming advocates in the open source community. Despite its promises, Flink has a long way to go to achieve the level of production adoption and popularity compared to Hadoop and Spark.
Apache Storm Overview
Storm is a framework for streaming data in real time. It uses a task parallel, distributed-computing system. Storm’s topology is designed as a directed acyclic graph (DAG) with spouts, bolts, and streams used to process data. In a topology, a spout will act as a data receiver from external sources and creator of streams for bolts to support the actual processing. When users integrate Storm with YARN, they have a truly powerful system for real-time analytics, machine learning and continuous monitoring.
This framework uses its own minion worker and Zookeeper to manage its process. Unlike Hadoop MapReduce jobs which eventually finish, it’s important to note that Storm topologies are designed to keep processing forever until they are killed.
Storms’s advantages include:
- A set of general primitives for doing real time calculation
- A highly scalable and very fast framework
- The ability to utilize Storm with any programming language
- Fault tolerant and handles errors very well
Hadoop vs. Spark
Comparing these two solutions depends on what one means when referring to “Hadoop.” Some use Hadoop to refer to the entire Big Data ecosystem of tools and technologies and others use it to mean the Hadoop Distributed File System (HDFS). When referring to the ecosystem, then Spark is a subset or one of the solutions available in that ecosystem. Often in comparing Hadoop vs. Spark, people really mean to compare Spark vs. MapReduce (the processing engine for Hadoop).
When Hadoop is being used to really refer to HDFS, than Hadoop/HDFS and Spark are two fundamentally different systems and used for different purposes. Essentially, HDFS is a storage solution and Spark is fast and general engine for large-scale data processing.
Companies use Hadoop/HDFS to store data in a reliable and secure manner. Companies use Spark to make sense of that data.
You can think of it this way: if Hadoop were ancient Egyptian hieroglyphics, Spark would be the Rosetta stone to understanding the data in Hadoop.
Spark offers a fast technology for large-scale data processing. This framework provides high-level APIs in Java, Scala and Python. And, it provides a very rich set of data stores including streaming processing and machine learning.
What is right for you depends on your use case. For companies needing a simple way to store large amounts of structured data, for example a database of medical records, Hadoop/HDFS may work fine.
On the other hand, if a healthcare facility wanted to conduct patient outcome modeling across several scenarios, Spark along with Spark’s machine learning capabilities could be an excellent processing solution that could also be integrated with Hadoop/HDFS for storing that data.
With Spark’s enhanced capabilities, comes with some operational complexity. Companies may need skilled data engineers to develop a Spark-based system.
Spark vs. MapReduce
MapReduce may be a legacy system for many companies who started their Big Data journey when Hadoop first came out. This framework has been the workhorse for large data projects. One of the significant challenges for systems developed with this framework is its high-latency, batch-mode response. Since MapReduce has evolved over the last 10 years some developers would complain that it’s difficult to maintain because of inherent inefficiencies in its design and code.
Spark, on the other hand, is more nimble and better suited for big data analytics. It can run on a variety of file systems and databases. Some users report that processes work on Spark 100 times faster than the same process on MapReduce.
Spark is also more developer-friendly with an API that is easier for most developers to use when compared to MapReduce.
Apache Flink vs. Spark
When comparing Flink vs. Spark, most people focus on the streaming aspects of each solution. Spark Streaming uses a fast-batch operation. It works on a portion of incoming data during a set period of time (referred to as “micro-batching”). While this approach to processing data works well for many situations, it may not be the best in all use cases.
If users truly need real-time data streaming analysis, a batch-processing system is not ideal.
For example, if users need streams to run real-time auctions, credit card fraud prevention, or to stream real-time patient alerts, a micro-batch process may not be sufficient. In this case, Apache Flink may be the better framework, because it can provide a more pure stream-processing capability with lower latency.
Apache Spark vs. Storm
While Spark and Storm provide fault tolerance and scalability, each framework uses a different processing model. Spark uses micro-batches to process events while Storm processes events one by one (when not considering Storm’s Trident solution). This difference in process handling means that Spark has a latency of seconds while Storm provides a millisecond of latency. Spark Streaming provides a high-level abstraction called a Discretized Stream or DStream, which represents a continuous sequence of RDDs.
Spark’s approach lets you write streaming jobs the same way you write batch jobs, letting you reuse most of the code and business logic. Storm focuses on stream processing (some refer to it as complex event processing). This framework uses a fault tolerant approach to complete computations or to pipeline multiple computations on an event as it flows into the system.