Statsd HTTP Proxy with batching

At $JOB, we’re using some hosted systems that don’t allow for prometheus-style exporting of metrics, essentially requiring push-style metric collection. Because there aren’t many well maintained push-style instrumentation clients that track counters inside a system (at least, not in Node.js), we needed to reach for a classic: statsd.

Statsd is an effective tool that relies on UDP for client submission of data. Which complicates things for us again. Because we can’t guarantee that the two systems will be on the same network, and because we would like to encrypt traffic between systems (where possible), standard statsd won’t work. Fortunately, someone wrote an effective http -> statsd proxy in Golang. This tool gives us a secure (TLS) connection for external services to write to, and then emits aggregated stats through the standard statsd protocol on the other end. Now we can ensure UDP traffic routes within a k8s cluster (or other local network) only, assuaging some concerns about encrypting traffic. So we deplyed the proxy, configured the application to emit statsd-style metrics to an HTTP endpoint, and everything worked. End of blog post.

Well…

So while sending UDP packets from the statsd proxy to an actual statsd listener (we used telegraf, ‘cause lightweight golang services are nearly always a good choice) is well-trodden ground, we started to see some performance problems writing to the proxy itself. Obviously we maintained a connection to avoid re-negotiating TLS, but each individual metric was written as a single request.

Anyone who’s worked at large volumes of data before would immediately hear an alarm bell ringing: BATCHING!!! If we can find an easy way to write groups of metrics to the statsd proxy, we’ll drastically reduce our over-all network traffic (because of course we’re compressing our writes…) and total applciation I/O time on both sides of the pipe. So after some work, our fork of the statsd-http-proxy allows for a new /batch endpoint. Full documentation is here.

Switching to these batched writes of stats meant we no longer drop metrics from our push-only applications, which has made for much better monitoring of these systems. It’s kinda neat!