Why Zabbix is Bad

$JOB was running Zabbix when I got there. And a lot of people at $JOB really liked Zabbix. While I personally wouldn’t choose it, it does have some good design choices. But blog entries aren’t about positive things! So…

(Quick note, this list was originally written before Zabbix 5.0 was released)

Dumb Zabbix Stuff

  • Removing an item/template that is spread across many host objects doesn’t always work
    • As the size of an install grows, the SQL command to remove an item/template doesn’t change.
    • All queries in Zabbix may have a default timeout.
    • Because of this, an expensive delete frequently times out and puts Zabbix in a bad state.
  • Maintenance objects don’t expire properly.
    • A host object attached to a maintenance object can’t be deleted. This is frustrating as maintenance objects pile up.
    • This winds up requiring manual clean up of maintenance objects over time. Which means additional external code.
  • Item API can’t create formulas the UI can.
    • The API claims formulas have a limit of 255 characters, but the UI accepts a formula with 500 characters successfully.
  • Can’t get queue data through the API
    • The queue data reports on what items haven’t checked in for some time
    • getting a report of these would help with automating management of dead/inactive hosts
  • API marks type of values but returns everything as strings.
    • Makes validation of returns complicated, as you may expect an integer, but you will always get strings back.
  • API does not allow for pagination of results
    • Larger instances (100,000+ items/triggers) will simply return 5xx rather than the items/triggers requested
  • Trends (aggregation) uses average for storage
    • causes smoothing and loss of accuracy in data over time
    • should use sum instead
  • items can’t be tagged
    • triggers, templates, etc. can all be tagged, but items can’t. Frustrating because some reporting would benefit from item tagging.
  • Can’t schedule ack/silence for a single trigger
    • In many systems (Nagios, etc.) you can set downtime for individual items on host objects to avoid spamming people without disabling an entire host object.
    • https://support.zabbix.com/browse/ZBXNEXT-721 (open ticket for 8 years.)
    • support’s solution is to edit triggers directly to not alert during the time range (which is a terrible solution)
  • Can’t re-trigger active items/triggers for immediate runs
    • This is no different than other tools
    • passive items in many systems can be easily re-run, active items require work on the host in question
  • No flapping detection exists
    • this is largely because of how trigger expressions work and can’t really be held against Zabbix
    • by increasing time ranges of data compared or other “flattening”, we can largely prevent flapping
  • UI Problems…
    • Back button can re-run POST commands, potentially damaging the DB
    • Data can randomly cache in UI elements (e.g. alias in the user search is populated unexpectedly), causing confusion
  • Inability to properly migrate/manage certain portions of the system
    • Global regexps can’t be managed except in the UI or at the database level.
  • Can’t test/validate actions (https://support.zabbix.com/browse/ZBXNEXT-97)
    • no way to be sure that a created action would perform what was expected without creating artificial failures
    • this isn’t far off from other systems, but given the complexity of Zabbix actions it is a bit frustrating
  • Log lines can be randomly truncated
    • leads to loss of clarity in debug situations
    • e.g. 0" of type "string" is not suitable for value type "Numeric (unsigned)" <- no timestamp, etc.
  • General zabbix logging is poor
    • no level indication, so it isn’t clear when an event is informational or an actual error
    • debug logging essentially spams queries into the log and suppresses any useful information
  • ES integration seems to silently fail
    • possibly caused by stuck subprocesses?
    • we were never able to suss this out and moved to a different backend.
    • May be resolved in Zabbix 5.x
  • Zabbix may still suffer from the 32-bit timestamp limit problem
    • This is potentially resolved in 5.x
  • Zabbix internal server stats can only be collected by a direct connection to the server
    • that may also require a pull connection
    • essentially, the zabbix agent running on a Zabbix server host can’t be configured as active.
  • A single host object can’t have two separate items with the same key attribute.
    • This is simply a strange restriction. Each individual item key on a host must be unique.
    • Basically, if we have two templates that happen to create the same item, one of them will fail, and we won’t be able to attach both templates.
  • Zabbix proxies use a relational database for local caching.
    • This isn’t like…illegal, but it is bad practice
    • MySQL doesn’t release data back to the OS in any consistent way, which can cause disk space problems even when the actual cache is small.
    • Our deployment landed on using MySQL for proxies because of faulty initial performance testing. We’ve switched to using sqlite3.