3 tips for maximizing the efficiency of your pub/sub system
My colleague Rob mentioned in a previous blog post that having a powerful pub/sub system can help make internal communications scalable, reliable, and fault-tolerant. In this blog post, I’ll cover the powerful pub/sub system strategies that have helped engineers at AppLovin reliably deal with the network, scalability, and monitoring issues that inevitably come up with any sophisticated infrastructure.
Here are my suggestions for maximizing the efficiency of your pub/sub system.
Don’t underestimate the importance of capacity planning.
Systems need to run autonomously with room to deal with bursty load without interference up to a point. Monitoring solves the first problem of determining when a system is getting more load than before, but if your systems don’t have room to grow, then every little hiccup in traffic will end up with a call to developers or DevOps and time spent adding servers and capacity.
At AppLovin, our pub/sub system has extra capacity to grow, and our DevOps engineers constantly improve networking capacity and other potential bottlenecks so that no one gets a PagerDuty alert in the middle of the night. We are well over-provisioned on the network side, which helps when we add new partners and our traffic increases proportionally. We also made it super easy to set up another broker within a cluster with automatic region determination, automatic topic determination, and even automatic broker host detection. All of this allows our DevOps engineers to easily add more capacity to a specific cluster without developer input. This process is also applicable to most of our other internal services, so capacity planning can be simple and uniform across the board.
Make sure your pub/sub system is fault-tolerant.
Network partitions, servers going down suddenly, and internet outages are all common problems that need to be dealt with in a reliable manner. Even within a data center, there is the possibility of network partitions and communication loss between racks of servers. When you are having issues such as network partitions or internet outages, having a pub/sub system that can handle these issues without (mostly) needing intervention can ensure one less headache.
Our pub/sub system operates on a k-safety replication of messages to ensure that messages don’t get lost in the case of internal or external issues. We use an odd k-safety number in which a single broker failure does not stop normal operations but also prevents network partition issues in which the cluster becomes two separate but valid clusters. We have chosen to have a k-safety of 3, and usually our cluster sizes have been 4 servers. This way, single server failures can be tolerated without creating issues but cluster partitioning is prevented. When servers suddenly die and k-safety of broker hosts is not lost, the other brokers in that cluster can pick up the slack, accept new messages and deliver the replicated messages without issues. Since most of our clusters can run on the bare minimum number of brokers, this allows expected and unexpected maintenance on brokers to not affect traffic. If the k-safety of available brokers is violated, then the cluster stops accepting messages, yet allows already sent messages to be consumed.
Fault tolerance also means having the ability to keep a backlog of messages around when subscribers suddenly die. We run with decent backlog availability, so persisting messages during a subscriber failure is feasible. A fault tolerant pub/sub system can make dealing with many different issues easier.
NSM: Never Stop Monitoring.
One advantage to having a centralized broker system is that it offers the ability to monitor the status of topics conveniently. We have two types of monitoring at AppLovin: active and passive. Active monitoring alerts engineers quickly when thresholds are hit and things aren’t going right, and passive monitoring helps us debug during the issue and postmortem after the fact. Zabbix is one of the active monitoring systems we use to make sure we are notified quickly. As all messages must pass through at least one broker cluster, sending metrics to Grafana, a data visualization system we use, allows us to easily monitor the changes in a topic passively. Grafana is helpful in postmortem examination of a problem once it’s been mitigated.
When an subscriber goes down due to server failures, database issues, or whatever else, our monitoring systems can detect an uptick in messages sitting on a broker and alert us, sometimes even before the subscriber can send out alerts. When brokers themselves are having issues, monitoring those systems can provide insight into issues.
One of our brokers in a cluster was recently inundated with requests from different subscribers and publishers, increasing network activity nearly 2x compared to the other servers in the same cluster. The connection increase saturated its network card and make connections to it very slow. The active alerts were going off for different issues: slow publisher uploads, slow subscriber downloads, slow servers, and slow cross-datacenter routing. The passive monitoring system revealed the cause (see figure above): an equivalent rise and drop in network activity for the affected and unaffected servers respectively. A DevOps engineer, looking at the network graphs, realized it was a network saturation issue. As soon as the erring broker was shut down, the rest of the cluster returned back to normal. Due to available capacity and replication of messages that are part of our system, the cluster was able to deal with the increased load with no loss of data. We determined the system was hitting upper limits and a pending plan to upgrade the network was fast tracked and implemented immediately.
Taking advantage of a pub/sub system that can handle issues gracefully, scale smoothly, and be monitored easily, has helped AppLovin engineers create an architecture that allows us to manage and leverage large amounts of data smartly. These three strategies can go a long way toward making any pub/sub system efficient and stable.