Having just had a rather long meeting about part of the monitoring infrastructure at
$JOB, I started thinking about push vs. pull monitoring. And Doctor Dolittle (https://en.wikipedia.org/wiki/List_of_Doctor_Dolittle_characters#The_Pushmi-pullyu).
Let’s run it down real quick:
Push-based monitoring requires the initial source of the data to collect and then push that data to whatever aggregation/alerting platform is used. Essentially, we “push” data from one place to another.
- NSCA checks in Nagios
- Active Checks in Zabbix
- Can’t easily verify state of collection tool -> A lack of data might mean the agent has failed, and it might me the host/device has been de-commissioned.
- Much more difficult to control what data is put in your monitoring system -> because agents produce the data independently, it is (potentially) easy for any downstream system to produce data and ship that data
- Simpler initial deployment in existing infrastructure (open >=one port to one host, not >=one port on all hosts.)
Pull-based monitoring has a central system reach out to other hosts and execute commands or collect data which are then “pulled” back to the central server. Essentially, one (or multiple) hosts “pull” data from locations.
- NRPE checks in Nagios
- Passive checks in Zabbix
- Requires more open network -> The collection agent needs to be able to reach every host it collects from and scrape data
- Registration/Discovery needs to be centrally managed -> The collection agent must know about the hosts before collection begins.
- Ensure you only collect the things you define to be collected.
Of course both systems have advantages and disadvantages. Of course your actual workloads/architecture best defines what system will work best for you. For example, Prometheus is effective in Kubernetes because as each container is created/destroyed, the orchestration engine is immediatelly aware and can communicate with Prometheus about it. You’ll also see Prometheus authors argue that pull is the “slightly better” system.
For me, there isn’t a clear winner, except in large existing systems (e.g. multi-DC deployments of large systems without current monitoring). This is purely because the deployment story around push systems is much easier:
"Please install this agent on your hosts with this config and then you'll magically see metrics/alerts appear over here."
I’ve had far more success deploying/advocating for/maintaining push-based systems in environments with large amounts of existing infrastructure. It’s easy for disparate teams to reason about, and deployment is easy even in systems without modern config management. It also presents itself as a “smaller attack surface”, which security teams tend to appreciate.
Undoubtedly this discussion will continue. And this is largely because neither system has an absolute advantage over the other.
In both cases, we can create aggregation points. Servers that either are the “collectors” for a region/datacenter/pod/etc. or are the submission points for the same. From that aggregation point we can (if needed) switch from push to pull or vice versa. The idea being both models are flexible and both can be effective in disseminating data.
Good things about Pull
Because we have to explicitly add hosts, we can validate incoming data more effectively. This allows us to perform “missing data” checks more successfully, too. If a host we expect to be allowing requests stops allowing those requests, we can alert on that accurately. With a push architecture, we can’t say for certain whether missing data from a host is expected or not. Perhaps the host was shut down briefly? Perhaps there is other maintenance occuring that is causing lag in stat production? Without the stronger knowledge that pull can give you, alerting on a lack of data is fraught.
Good things about Push
Performance. Push-based architectures have distinct performance advantages in resource-constrained situations. For example, try running prometheus on some Raspberry Pis while the Pis are heavily loaded. There will be random and frequent data drops. Those same hosts running telegraf will consistently ship data. This is largely because there are fewer steps for a push-based system to produce and send data.
Challenges with Pull
Pull-based monitoring essentially requires service discovery or registration of some sort. In the case of systems like Kubernetes (where Prometheus was essentially born) this is simple. Without a system like this, ensuring hosts are registered/configured in your monitoring system does require some extra work and frequently hosts can be forgotten or mis-configured. Basically, without solid service discovery pull-based monitoring becomes more challenging.
Pull-based monitoring can also run into scaling problems if you have one central instance pulling from all systems. While this particular issue is easily solved with creating hub-and-spoke systems (or other designs where smaller chunks of data is stored on different machines), it is an issue to be concerned with.
Getting pull data from all hosts also requires opening ports on all hosts that the central host will need access to. In many cases, this isn’t terribly problematic, but with hosts directly connected to the internet, this can create a significantly higher attack surface.
Challenges with Push
The major problem with many push-based systems is knowing what an absence of data means.
Is data missing because a host was decommissioned or because the collection tool has failed?
We lost data on this range of hosts for the last 6 hours. Was there a network error? Or were they intentionally disabled?
While pull-based can also have missing data, it’s much easier to know that data missed during a pull is actually missing due to problems and not intentional changes.
Personally, I prefer push-based collection as I find it easier to deploy and manage. While addressing missing data is a bit more difficult, the simpler deployment, smaller network changes needed, and reduced load that push allows for makes it a great choice for creating a solid monitoring pipeline. Particularly in existing networks.