Heartbeats: How Synthetic Traffic Keeps Us Running, Vol. 2
Oh, let me tell you all about how we’ve been using heartbeats in our application design, like a love story with no broken hearts. It’s a tale of synthetic traffic generated by the application itself, the deployed app periodically sending heartbeats at a defined schedule.
Heartbeats are like the heartbeat of our application, ensuring regular and predictable traffic. In our case, they were low volume, but they were low volume because we wanted to optimize performance and reliability. Unlike application traffic, which could vary wildly from zero to huge throughput depending on the cluster, heartbeats were designed to handle stable, predictable traffic without compromising performance.
Our use case for heartbeats was to handle high latency between the data being inserted and published to Kafka. This was a common issue for our application, as we wanted to ensure that updates were being processed in a timely manner to prevent data inconsistencies. To solve this, we created a monitoring system using Datadog that could detect and alert on metrics such as:
Heartbeats: How Synthetic Traffic Keeps Us Running
Let me take you on a journey of how we came to use heartbeats in our application design. It’s a happy story of love and no broken hearts along the way.
What are heartbeats?
What my teams have called heartbeats are a form of synthetic traffic generated by the application itself. The deployed application periodically generates heartbeats at a defined schedule.
Heartbeats provide guaranteed regular traffic. In the cases I’ve used them, they have been low volume. In contrast to application traffic, which could vary massively from zero to huge throughput depending on the cluster.
ChatGPT generated image from the intro above
Sounds simple, why do I need heartbeats?
Thanks for asking. Let’s go over some of the use cases that led to us introducing heartbeats.
Use case 1 — Escape
Escape is the name of the service we deploy at Zendesk to support transactional publishing to Kafka along with a write to MySQL. If an application team wants to update a database record AND publish to Kafka as a transaction, Escape is the way to do it. The application team also writes to an additional table the details of the message(s) that need to be published to Kafka, and Escape does the rest. This unburdens developers from having to solve the complex problem of guaranteeing transactional consistency across 2 data stores.
My team is responsible for deploying and managing Escape. As part of that, we want to alert on things like:
High latency between the data being inserted and published to Kafka
The pipeline halting / messages not flowing through the pipeline
Monitoring for the pipeline halting is an interesting case. The application itself can’t be responsible for emitting a metric / triggering an alert that there is a problem with the pipeline being down. The root cause might be that the application is not even running!
Given the application can only emit that it is alive, we can alert when those metrics stop being emitted.
We use datadog for monitoring, and can use a query like the following to trigger an alert:
sum(last_5m):sum:escape.success{} by {cluster}.as_count() < 1.0
This query will alert if over the last 5 minutes there have been no events handled by Escape successfully. Note that the by {cluster} means that if any database cluster is not seeing events handled, the alert will tell us which cluster is experiencing issues.
So far, so good. What is the limitation of this approach? What if the customers on that cluster aren’t performing any updates? What if it is a staging database cluster that doesn’t have updates over the weekend when the devs are off skiing and surfing? That would trigger false alarms when everything is healthy.
To work around this, we generate heartbeats. For each cluster we are processing, we periodically insert a request to publish to Kafka. We send the messages to a heartbeats Kafka topic which nothing consumes, but doing so gives us full end to end confidence in the pipeline. Now we have solved our false alarms.
Not only have we solved our initial monitoring concern, we have also built an always-on smoke test for our functionality. If we deploy a version of our code that breaks the flow of data, we will receive an alert for it pretty quickly.
One limitation of this approach is that if there is an issue with the ingestion of metrics by datadog, then an alert will trigger even though everything is healthy. Contrast this with a latency monitor, if metrics aren’t being ingested or there is a backlog of metrics to process, the latency monitor will not fire.
Also note that for a completely new database cluster, the monitor will not fire until there has been at least one successful metric emitted (the grouping by cluster needs to first be aware of all of the clusters). In practice we haven’t found this to be an issue.
Use case 2 — Account Moves
Behind the scenes at Zendesk, the data for a given customer account lives in one of our regions across the globe. We don’t want an account to be forever in the original datacenter it was created, so we have robust account move tooling which allows us to move an account to a new region with near-zero downtime.
The physical shifting of account data during a move generally has 2 phases, Bulk and Delta.
Bulk takes a snapshot of everything (eg mysqldump). Delta then consumes a change stream for the datastore to handle any updates that might have occurred to the account since bulk started.
While reading the change stream, our account move processes need to determine when they are up to date reading the change stream.
When there is data to process from the change stream, we get an exact calculation for how long ago the event read from the change occurred at. But what if there is no data to read from the change stream. While the absence of reading data from the change stream provides an indication that the process might have caught up, it doesn’t guarantee there isn’t an issue with infrastructure preventing the flow of messages.
To gain confidence, we periodically insert messages into an independent table (for mysql) or collection (for mongodb) so that we are guaranteed to have a steady stream of updates that we can use to calculate change stream lag with confidence.
As a result of heartbeats, we are able to guarantee that the change stream has been processed to a particular point in time with confidence.
Summary
This is a small write up of our use of heartbeats / synthetic events at Zendesk. They provide amazing observability insights for very little overhead. In some instances, we have even managed to obtain continuous testing and monitoring through the use of heartbeats.
Heartbeats are a simple idea / concept so I’m sure many others will be using similar patterns. And if you aren’t, hopefully this has given you some food for thought.
Thanks for reading!
Heartbeats was originally published in Zendesk Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.