I recently did a podcast with Maximilian Michels, an expert in Data Intensive applications. We discussed about Distributed Stream processing right from the very basic.
We discussed the following:
- What is stream processing?
- In simple terms, when you are consuming events from a source, transforming them, doing some aggregation (probably maintaining some internal/external state) and then writing it to a sink. And more importantly all of this in realtime, at high scale.
- Examples: Anomaly detection, analytics, alerting etc.
- How does it compare with Batch Processing and also event processing?
- Batch processing may not be directly compared with stream processing, as there is some batching involved at some level in stream processing as well.
- Batch processing use cases are where you have loads of data to process and you want to leverage the batching mechanism to find a result but may not be realtime. The jobs can be take hours to days depending on the dataset size, but latency isn’t a huge problem here. You want those events to be processed eventually.
- Event processing is slightly different and mainly used in Event Driven architecture where you want to make some decisions based on events. For example: Send an email to a Customer on Order confirmation. So once an order is placed and confirmed, you can produce an event “order_confirmed”, which can be consumed and processed to send an email. This is basic event processing.
- What tools are available to perform stream processing?
- Apache Flink, Kafka Streams, Spark Streaming, Beam are one of the most popular streaming solutions.
- Apache Flink vs Apache Kafka and some factors to decide between the two. https://docs.confluent.io/platform/cu… and https://flink.apache.org/
- At the API level they are very similar and provide similar constructs to run your streaming pipeline, but architecturally they are very different.
- Mainly when it comes to the deployment model, Flink is a cluster deployment whereas Kafka Streams API is an embedded library which can be run along with your Java application.
- Flink cluster needs to be setup before you could run production workloads. Kafka streams expects you to have a Kafka cluster, which you might have anyway which working with Flink.
- Flink is not tied to Kafka (pubsub), it can work with other sources as well.
- Kafka streams is obviously tied to Kafka.
- Kafka streams have less learning curve and best to get started with very easily. If you want to try out stream processing pretty quickly and setup a small stream processing pipeline, Kafka streams is the way to go.
- Flink needs a bit more thought, as it can be operationally challenging if you don’t have prior experience. However Flink scales really well ex. 1000nodes and millions of events per sec, which is really high scale. Kafka might struggle a bit based on the number of partitions on the topic and having 1000 partitions might be very tricky on Kafka (may be not recommended?)
- Use cases where we need Stream processing
- High throughput, low latency workloads
- Data can be parallelised such that distributed nodes can work on them in parallel.
- Ideally when data can be partitioned on a key to reduce network IO for shuffling. Data locality can make a lot of difference when it comes to optimising for network vs memory.
- Why Exactly once processing is a Hard problem and how Kafka and Flink approach that problem?Is it similar to exactly once delivery?
- Exactly once processing is a Hard problem because in a distributed systems failures are inevitable and to get that guarantee at each step the progress must be saved to NOT repeat it in any case. However this is very costly and may not be practical.
- In cases when failures happen, retries are done and with retries exactly once is really hard without native support at the system level.
- Apache Kafka and Flink both has Exactly once processing semantics but it is different from Exactly once delivery. Exactly once processing means that the side effect of the processing will be as if it was processed only once. So your state will be correct even if events are processed multiple times.
- For example Kafka has an idempotent producer, transactional producers and consuming levels like read_committed to guarantee EOP. It works but comes with a cost, latency, IO etc.
- Do we really need exactly once?
- It really needs a lot of coordination though, so “It is best to avoid exactly once wherever possible and work with at least once guarantees”
- Why data locality matters?
- If you have your data locallly you don’t need to get it from a remote location which involves network IO, network is unreliable and spiky so getting a deterministic latency is nearly impossible. So for low latency scenarios, it is best to have the data locally to reduce “shuffling” (how it is called in distributed stream processing world)
- Should we make network calls from the stream processor OR have a join instead? How do decide?
- For example. you are processing a stream of customer data and you want to enrich the events with customers credit score information. So you need the credit score information from somewhere. There is a service that provides this information so you decide to make a service call for each event.
- This is a general pattern and in many cases unavoidable, however ideally network calls if possible should be avoided while processing streams as the remote service may soon become a bottleneck.
- Imagine a cluster of 100 stream processing nodes, hitting a service to get some data. Now you may not be able to increase the number of stream processors beyond a certain limit. Off course the data can be cached etc, but it comes with a risk of stale data.
- Another approach which is recommended in the Stream processing world is to maintain a state of data locally and then performing “joins” on streams as compared to remote calls.
- This state can be maintained consuming change events on the remote entity and maintaining a local state of the entity. For example, in our credit score example, we can consuming “customer_credit_score_event” and keep an updated state locally. In Kafka you could use a K-Table which can be easily streamed on and joined with another stream.
And so many other important considerations based on real production experience in operating stream processing pipelines at a very high scale.
Please watch the episode for full conversation which has so many other insights. Also like, share and subscribe to the channel.