Embracing the world of streaming analytics with Celebrus and KafkaPublished: Tuesday, 31 October 2017 by Ant Phillips, Senior Developer
With the new release of Celebrus, we've introduced Apache Kafka integration across the product. For those of you who aren't yet familiar with Kafka, it is a distributed streaming platform – but what exactly does that mean?
Well in general terms, Kafka enables applications to share data using a publish and subscribe design pattern similar to message queues and enterprise messaging systems. What's different about Kafka is that it does this in a very elegant way which provides extremely high performance and low latency. Both of these characteristics make it very well suited to building real-time streaming data pipelines that react to streams of events.
The integration points for Kafka in Celebrus are threefold:
- Data sources
- Data streams
- File transfer
For the remainder of this blog, I'll be talking about the first two: data sources and data streams. The particular use case I'm interested in solving is as follows:
- Visitors arrive on an e-commerce site and occasionally (hopefully, frequently!) complete purchases
- The system tracks the total spend for each individual
- The total spend uses a sliding window of three months and is updated in real-time
- After each purchase, the visitor's profile is updated in Celebrus with their total spend
- Once a day, the total spend for all customers is recalculated and the profiles are updated accordingly
This is an extremely powerful use case and demonstrates all sorts of capabilities available with Celebrus. So here's how you do it:
1. Kafka topic for purchase events
The first step is to create a Kafka topic where purchase events are sent. Nothing too difficult there – standard Kafka configuration with kafka-create-topic.sh.
2. Kafka data stream
The next step is to send purchase events to this topic in real-time. This is really simple to set up and requires a Kafka data stream (new in our latest release, Update 18. This data stream sends a JSON message to the Kafka topic each time a purchase event occurs (as long as you have configured purchase events in Celebrus). The purchase events include the SKU code, value, quantity and, critically, the visitor identification (typically their e-mail address, account number or username).
Here's a simple example of a purchase event:
3. Kafka streams topology
The next step is to set up a Kafka streams topology. There are many data stream processing platforms available (like Flink, Spark, Storm and Samza) and they all have strengths and weaknesses. Kafka Streams is a really excellent choice when you want a lightweight and flexible data processing engine which has no reliance on a distributed processing framework like YARN or Mesos.
The topology reads each purchase event from the source topic. A custom processor extracts the purchase information including the visitor’s identity. The identity is used to look up that individual's purchase history in a Kafka state store and the incoming purchase details are then added to this data. The aggregate sum of purchases in the last three months can then be calculated and forwarded. The forwarded message also has a different key, namely the visitor's identity. In Kafka terminology, this output is a KTable rather than a KStream.
This highlights a very elegant, but important technical point, around the duality of streams and tables. That's a story for another day though and, without being side-tracked by that discussion, the main point to understand is that each message forwarded by this topology contains an update – something like this: ("email@example.com" -> "456.99") where the key is the visitor's identity, and the value is an aggregated sum of product purchases in the last three months. These messages are written to a second Kafka topic which we will use next.
4. Kafka data source
The final leg of this journey is to set up a Kafka data source. This reads the aggregate events from the second topic. The data source receives the Kafka key (visitor identity) and value (total spending). This data source simply creates a profile event to update the visitor's profile with the new total spending. Job done!
If you've followed along so far, you can see that the general principle is for Celebrus to collect and deliver e-commerce data in real-time to Kafka. A Kafka topology processes this data and writes results to one or more Kafka topics. Celebrus then streams this data back in to operationalize the results as part of the visitor profiles. Once data is part of a profile, then it can be used for many different purposes, including decisioning, analytics, personalization, triggering third party applications and much more besides.
The final cherry on the cake is to enhance the solution with a scheduled refresh of product purchases. This effectively resets customers as purchases leave the current three month sliding window. One very appealing way to do this is with a Kafka punctuate. This facility in Kafka streams enables the full dataset for all visitors to be recalculated and subsequently updated in Celebrus.
If you want to more about how this works, contact us and ask a question!
By the way, more information about Kafka streams can be found at https://kafka.apache.org/documentation/streams/.