Update December 18, 2017: Nearly 2 years after this initial post, we discussed the Yahoo streaming benchmark in another blog post where we cover some of the issues we see with modern benchmarking methods.
Until very recently, I’ve been working at Twitter and focusing primarily on stream processing systems. While researching the current state-of-the-art in stateful streaming systems I came across Apache Flink™. I’ve known for some time that having proper, fault-tolerant, managed state and exactly-once processing semantics with regard to that state was going to be a game changer for stream processing so when I came across Apache Flink™ I was understandably excited.
Shortly after my first introduction to Flink I saw that the Flink Forward 2015 conference was about to be held in Berlin and I knew that I had to be there to learn more. I spent a few days in Berlin attending talks and having long discussions with the Flink committers and others in the Flink community. By the end of the conference it was apparent to me that Flink was far and away the most advanced stream processing system available in the open source world. I returned home determined to see what I could accomplish with Flink back at Twitter.
I had an application in mind that I knew I could make more efficient by a huge factor if I could use the stateful processing guarantees available in Flink so I set out to build a prototype to do exactly that. The end result of this was a new prototype system which computed a more accurate result than the previous one and also used less than 1% of the resources of the previous system. The better accuracy came from the fact that Flink provides exactly-once processing guarantees whereas the existing system only provided at-most-once. The efficiency improvements came from several places but the largest was the elimination of a large key-value store cluster needed for the existing system. This prototype system earned my team first prize in the infrastructure category at Twitter’s December 2015 Hack Week competition!
During the course of my work with Flink I also developed a good sense of Flink’s performance capabilities so I was very interested when I read the Yahoo! benchmark that was recently published comparing Storm, Flink, and Spark. The benchmark measures the latency of the frameworks under relatively low throughput scenarios, and establishes that both Flink and Storm can achieve sub-second latencies in these scenarios, while Spark Streaming has much higher latency. However, I didn’t think the throughput numbers given for Flink lined up with what I knew was possible based on my own experience so I decided to dig into this as well. I re-ran the Yahoo! benchmarks myself along with a couple of variants that used the features in Flink to compute the windowed aggregates directly in Flink, with full exactly-once semantics, and came up with much different throughput numbers while still maintaining sub-second latencies. The extended benchmarks are available on GitHub.
In the rest of this post I will go into detail about my own benchmarking of Flink and Storm and also describe the new architecture, enabled by Flink, that turned out to be such a huge win for my prototype system at Twitter.
During the process of working with Flink and building this application on top of it I’ve come to realize just how advanced a system Apache Flink™ actually is. As a result of this whole process and working closely with the team behind Apache Flink™ I’m very happy to report that I’ve decided to join data Artisans to continue this work full time!
For background, in the Yahoo! benchmark the task is to consume ad impressions from Kafka, look up which ad campaign the ad corresponds to (from Redis) and compute the number of ad views in each 10 second window grouped by campaign. The final result of the 10 second windows are written to Redis for storage as well as early updates on those windows every second. This is the same benchmark we are discussing in this section of the post.
All experiments referenced in this post were run on a cluster with the following setup. This is close to the setup used in the Yahoo! experiments with one exception. In the Yahoo! experiments the compute nodes running the Flink and Storm workers were interconnected with a 1 GigE interconnect. In the setup we tested the compute nodes were interconnected with a 10 GigE interconnect. The connection between the Kafka cluster and the compute nodes however was still just 1 GigE. Here are the exact hardware specs:
We used the benchmark programs from the Yahoo! streaming benchmark to measure the maximum throughput of each system while maintaining the best possible fault tolerance.
Below is a diagram giving an overview of the system used in the benchmark. On the left you see the original Flink job as reported in the Yahoo! benchmark. There is custom code to compute and cache the windows locally along with a separate user thread to flush results to Redis periodically. This is a direct port of the Storm job to Flink so it doesn’t take advantage of Flink’s window API for computing windows. On the right you see a similar diagram except in this case the job uses the Flink window API which fault-tolerantly manages the windows and emits them downstream based on user-specified triggers. This version is both simpler and provides better fault-tolerance guarantees: exactly-once.
While the above results are very interesting by themselves, it’s even more interesting when moving to applications that use large windows and many distinct keys. The starting job during Hack Week was a pipeline with over a million events per second, but windows of one hour, each window containing hundreds of millions of distinct keys per window.
Where the Yahoo! streaming benchmark job writes 100 windows per second to the key-value store (100 distinct ad campaigns), the above mentioned stream needs to update millions of entries in the key-value store per second. Scaling that type of streaming application quickly becomes an exercise in scaling out the database. In addition, unless you have state managed by the streaming system with exactly once guarantees you can’t actually do accurate counting in the face of failures. Like the Yahoo! benchmark system, the system in use at Twitter also suffered from these same issues.
The solution we came up with during Hack Week to circumvent the key-value store is shown in the sketch below. It is somewhat similar to the streaming job with Flink’s windows above, but instead of periodically writing current windows into a database (to make them accessible), we directly exposed the in-flight windows to be queried. That way, only final windows need to be written to the database (once per hour). This ends up reducing the load on the database dramatically while still making the state queryable immediately as it’s computed.
To query the window contents, we wrote a custom Flink operator that computes the windows and in addition runs an Akka actor system for the queries. Flink computes the windows and checkpoints them as part of its fault tolerance mechanism, while Akka acts like an RPC system that answers the state queries (“get state for key k, time t”).
Streaming application querying state in the stream processor
To show the value of this approach in the context of the Yahoo! benchmark we created another variant in which we have 1,000,000 campaigns and store and update the windows directly in the key-value store. When you do this the key-value store very quickly becomes the bottleneck in the streaming application. In our benchmark this bottleneck occurred at around 280,000 events/sec. Beyond that there was no way to increase throughput further without either scaling up the key-value store dramatically or getting rid of it altogether.
We chose to get rid of it altogether. This is the power of having fault-tolerant local state. The stream processor itself becomes the key-value store and updating state becomes fast and cheap in-memory processing rather than communication across the network to a remote store.
Using this model we were able to completely eliminate the key-value store bottleneck and achieve a throughput of 15,000,000 events/sec all while making the data directly queryable instantly as it’s processed. This is the new architecture that the new generation of stream processing technology such as Flink enables.
While we were mainly looking into throughput and fault tolerance in these experiments, we also evaluated latency. Because the programs are very different from each other, the latencies should not be compared to each other directly. They merely stand to show that all approaches discussed here achieve acceptably low (sub-second) latency.
While Storm paved much of the way for open source streaming, Flink and Storm represent, really, two different generations of stream processing technology. Flink can be used similarly to Storm (as in the Yahoo! benchmark), but comes with features that support new approaches to building streaming applications. Embracing these new approaches can lead to huge wins in many dimensions:
Special thanks to Steve Cosenza (@scosenza), Dan Richelson (@nooga) and Jason Carey (@jmcarey) for all their help with the Twitter Hack Week project.