Embracing Intelligent Monitoring At AppLovin
Early in AppLovin’s history, we had mysterious revenue degradation and it was seriously affecting our business. Our engineers spent the next few weeks poring over the logs, analyzing the logic of the core components, conducting multiple code-reviews, performing various experiments, and hypothesizing about potential causes of the degradation. It was painful because our monitoring was inadequate and didn’t help us find and fix the problem.
After a few weeks, we isolated and resolved several minor bugs that had joined forces to produce this costly performance issue. Painful as it was at the time, we learned a valuable lesson that to this day infuses our product and our work: Intelligent monitoring goes a long way with respect to protection and optimization. Indeed, perfecting AppLovin’s monitoring infrastructure became of paramount importance, and that means availability monitoring, application monitoring, flexible visualization, and a learning system for alerting.
Here are a few things that we learned about monitoring that others might find useful:
Start with availability, hardware and network monitoring
Availability monitoring is now the first pillar of the AppLovin monitoring infrastructure. All mission-critical services that need to be up are monitored by an array of monitoring services: Pingdom, Site24x7, and several ISP/vendor-provided services that operate at the hardware and network level. Our hardware is monitored with Icinga, and we use Smokeping for mesh network monitoring.
All of our services have HTTP endpoints that can be easily integrated with monitoring probes. Since all services extend from a common base, every new service has availability monitoring by default.
We found that standard plugins were not enough to cover all of the cases, so we developed about a hundred custom plugins that monitor and alert different aspects of the hardware.
For instance, failed RAM modules cause servers to fail. The failures, of course, happen at the worst time possible, when no one is watching. We came up with a probe that checks dmesg and mcelog for errors, which allows us detect RAM failures earlier and remove machines that are about to fail — before they fail indeed.
And of course in the spirit of community, we’ve open-sourced several Aerospike Nagios plugins that our ops team created.
Monitor application and third party services
At AppLovin, application monitoring is a second pillar of our monitoring framework. The basic HTTP-based availability endpoint returns a JSON object that contains basic stats about the service. Different services have different stats: messaging services report queue sizes, services that cache data indicate how stale the cache is, data processing services return number of operations per second, etc. This allows smarter versions of monitoring and alerting plugins be attached to each service. AppLovin engineers build most of the new features so that they can report stats to the monitoring system.
But we wanted to make sure that our monitoring approach was comprehensive, and that meant monitoring our partners, not just ourselves, to help keep us all secure. That way when there’s a problem and we’ve determined that the issue isn’t on our end, we can alert our partners and help them zero in on what’s awry on their end.
But at this point, much as they were optimized, the hardware monitoring and the application monitoring were distinct worlds and didn’t integrate well. We needed a way for everyone to be able to use all the monitoring easily. That was our next challenge.
Spare no expense on metrics visualization
If our first pillar in our updated intelligent monitoring strategy relied on ramped up hardware and network watch systems, and our second pillar was extensive application monitoring, our third pillar depended on visual rendering of our metrics. Back in 2013, when the degradation incident occurred, we had a fairly limited number of metric data points to report. However, as our network grew, the monitoring probes reported more and more data, and we began to look for an optimal display solution. We started with Munin, which worked well enough for ~1,000 data points per minute, but the tool became less reliable as volume increased.
So the second iteration of the display solution was a custom-built graphing system that used MySQL for storage and Highcharts for display. This solution lasted a bit longer, but unfortunately we wound up spending more time supporting the metric display framework than we did adding new metrics themselves. So we had to shift gears and replace the custom-built system in favor of Graphite with Grafana for our graphing needs. This is the combination for our visualized metrics we rely on today. Some services write directly to Graphite, while for others, data points are collected through plugins. Enhanced visualization has helped gather monitoring information collected from levels of our system one spot.
One thing that’s really advanced data rendering is the advent of large 4K HD monitors. At AppLovin, we have nine 65” 4K monitors around our office (including one in the billiard room) that show Grafana graphs with various monitored stats in a dashboard format. Typically all it takes is a glance at a cluster of displays to determine if the system is operating as expected.
One of the key advantages to centralized graphs is that they correlate application metrics with server metrics. For instance, one can easily notice that CPU load has gone down significantly because of a reduction in request volume. That, in turn, correlates with a new frequency cap feature deployment. Spotting patterns like that makes isolating issues a lot easier.
Having responsive graphing of any metric you can think of means that the engineers and business team can get insight into the system instantly. Grafana offers zooming and easy addition of different data sources for comparison, which is key to truly useful visualization.
Implement feedback loops relevant to the business team
The last essential component of AppLovin’s monitoring system is a set of alerts that go off in case deviations in business-critical stats (revenue, impressions, ad requests, win rate, etc.) are detected. We are very careful to make sure that only relevant deviations trigger alerts because too many false-positives could reduce awareness and reaction.
Now we are taking our monitoring and alerting system one step further by adding a feedback loop that will allow the system to learn from its correct alerts and misses. Since all of the data are available in one place, we are using multiple inputs to detect discrepancies. It is an interesting task, as we have to account for daily, weekly and monthly trends that might impact alerting tolerance. We’ve outgrown simple thresholds on metrics and now use something that’s more intelligent.
Bonus: everyone is more responsive to customers
Another added bonus to ramping up our monitoring is that it has helped unite the business and engineering sides. The members of the engineering team have learned more about company’s business affairs, while the members of the business team have learned about technology that backs AppLovin’s operation. This sharing of knowledge has only added to our culture, where we value transparency, and people truly care about each other and feel responsible for the success or failure of the company.
The business team is seeing benefits from the monitoring system — combining application level alerts with business goals and metrics and creating automated email reports and alerts enables it to see what’s happening and make predictions across a wider spectrum of our customers, and that means faster and better results for them.
As of February 2015 we are processing about 500,000 metric data points per minute, coming from ~1000 servers across nine data centers. A server has an average 550 monitoring probes. Almost all of the monitoring systems work at a minute level granularity, and we are moving towards per-second granularity. We even monitor weather conditions around our data centers.
Building a scalable and versatile monitoring system has accomplished a number of goals for the company: Issues can be isolated a lot faster through metric correlation; our anxiety is reduced when we’re deploying new code; our alerting systems became a lot more intelligent; and through the work on the monitoring system, business and engineering teams grew a lot closer. Best of all, we’re confident that in addition to protecting our revenues, our enhanced, intelligent monitoring system has protected the revenues of our customers.