Spark versus Flink

Posted on

Spark is an open source Apache project that provides a framework for multi stage in-memory analytics. Spark is based on the Hadoop platform and can interface with Cassandra OpenStack Swift, Amazon S3, Kudu and HDFS. Spark comes with a suite of analytic and machine learning algorithm allowing you to perform a wide variety of analytics on you distribute Hadoop platform. This allows you to generate data insights, data enrichment and data aggregations for storage on Hadoop and to be used on other more main stream analytics as part of your traditional infrastructure. Spark is primarily aimed at batch type analytics but it does come with a capabilities for streaming data. When data needs to be analysed it is loaded into memory and the results are then written back to Hadoop.


Flink is another open source Apache project that provides a platform for analyzing and processing data that is in a distributed stream and/or batch data processing. Similarly to Spark, Flink comes with a set of APIs that allows for each integration in with Java, Scala and Python. The machine learning algorithms have been specifically tuned to work with streaming data specifically but can also work in batch oriented data. As Flink is focused on being able to process streaming data, it run on Yarn, works with HDFS, can be easily integrated with Kafka and can connect to various other data storage systems.


Although both Spark and Flink can process streaming data, when you examine the underlying architecture of these tools you will find that Flink is more specifically focused for streaming data and can process this data in a more efficient manner.

There has been some suggestions in recent weeks and months that Spark is now long the tool of choice for analytics on Hadoop. Instead everyone should be using Flink or something else. Perhaps it is too early to say this. You need to consider the number of companies that have invested significant amount of time and resources building and releasing products on top of Spark. These two products provide similar-ish functionality but each product are designed to process this data in a different manner. So it really depends on what kind of data you need to process, if it is bulk or streaming will determine which of these products you should use. In some environments it may be suitable to use both.

Will these tool replace the more traditional advanced analytics tools in organisations? the simple answer is No they won’t replace them. Instead they will complement each other and if you have a Hadoop environment you will will probably end up using Spark to process the data on Hadoop. All other advanced analytics that are part of your more traditional environments you will use the traditional advanced analytics tools from the more main stream vendors.