What is the best big data processing framework?

20778886222_e6a0f46ef3_o

When we look to some of the data processing frameworks, we see questions about if Apache Flink is better than Apache Spark, or about when Spark is a better choice than Hadoop, etc. I would say that the choice of one of these depends on one’s business requirements.

Spark and Flink offer better performance and huge gains over traditional Apache Hadoop MapReduce. Apache Flink, although it can be used in batch processing scenarios for ETL workloads, is more focused on providing stream processing solutions when we need sub-second latency. Flink is a great tool for IoT cases, where vital signals drawn from sensors need to be processed and event flow processing results need to be delivered in real time. On the other hand, Apache Spark is a good solution when response time fits well on seconds or minutes scale. Using Spark, we can also store and perform further analysis over the RDDs generated during workload processing. If response time is not so critical and batch processing jobs are executed out of peek time (overnight), Hadoop may be an alternative. Hadoop has achieved maturity over a decade, and has gained a solid market share due to a large number of tools that integrate the ecosystem, such as Pig (scripting), Hive (DW), HBase (column store), Mahout (machine learning), Giraph (graph), Oozie (workflow), etc. In addition, there are many Hadoop distributions available for production environments, such as Cloudera, Hortonworks, MapR and others.

With business looking forward to provide even faster results, we see a paradigm shift in the big data processing landscape. A few years ago, companies adopted Hadoop, but with the arising of Apache Storm and stream processing, there was a very strong appeal to get real-time responses as well. However, Storm’s stream processing could not guarantee the delivery of a consistent view of the data, although very close to the actual results. This concept is what we call Lambda architecture: Storm was included in the data flow, but its results could later be adjusted with the correct results obtained by batch processing on Hadoop.

With the arrival of stream processors such as Flink, Samza, Apex and Gearpump, the results of stream processing began to offer a consistent view of the data. As a way to simplify the complex workflow of the Lambda model, the market has been adopting a more concise and simpler approach for fast data, the Kappa architecture. Kappa can be adopted for both batch processing and stream processing purposes. Inside of Kappa architecture, there’s a distributed commit log component, and it works as a layer of integration between data producers and data consumers, such as pub-sub system. For this integration layer, we can adopt Apache Kafka or a cloud solution like Amazon Kinesis. In this layer, all the input events generated by the data producers are stored in a durable data storage, and what determines the latency of data delivery is how long data consumers consume the data from the these distributed queues. If you need to process data and deliver results in real time, you may choose Flink for stream processing. If consumer is a batch mode ETL system, Spark may be a great solution too. The advantage of a Kappa architecture is that we have a reliable and source of truth, and solutions like Apache Kafka become the central piece of the Kappa architecture.

Text: Luis Cláudio R. da Silveira
Revision: Pedro Carneiro Jr.


Image credits:

  • “The Commons”, Flickr.com / Internet Archive Book Images – https://www.flickr.com/photos/internetarchivebookimages/20778886222/sizes/l/ (Image from page 293 of “Plant propagation : greenhouse and nursery practice ” (1916)Title: Plant propagation : greenhouse and nursery practice, Identifier: cu31924073971149, Year: 1916 (1910s), Authors: Kains, M. G. (Maurice Grenville), 1868-1946, Subjects: Plant propagation, Publisher: New York : Orange Judd Company, Contributing Library: Cornell University Library, Digitizing Sponsor: MSN)