At $JOB, we needed a highly scalable metric storage system and we had a nice little Hadoop cluster I’d set up that could handle some IOPs. So we decided to test OpenTSDB and Druid.

Druid was certainly appealing:

They claimed OLAP powers
Ambari had a built-in install
There was a Grafana plugin

The problem with Druid was simple: we couldn’t successfully query data once we got it in. Both the Grafana plugin and CLI tools could not produce readable data after weeks of effort.

Given our problems with Druid (and the fact that OpenTSDB did not exhibit these problems), choosing our TSDB was easy.

So…

OpenTSDB.

OpenTSDB is a Time-series database (TSDB) that uses Apache HBase for storage. Apache HBase is a distributed NoSQL datastore built on Hadoop and based on Google’s Big Table paper.

OpenTSDB is actually one of the older Time-series databases (TSDBs) out there. This does cause a lot of inaccurate documentation to show up. You’ll get recommendations on how to use/tune OpenTSDB for versions from 6+ years ago. And you’ll see companion suggestions on how to tune HBase from the same time. Given the age (and the significant changes HBase saw with the 2.0 release), a number of these recommendations aren’t quite accurate any more. So I’m going to document the setup we landed on that has given us fairly solid performance for OpenTSDB.

OpenTSDB config:

tsd.network.async_io = true
tsd.network.reuse_address = true
tsd.network.tcp_no_delay = true
tsd.http.query.allow_delete = true
tsd.http.request.cors_domains = *
tsd.http.request.enable_chunked = true
tsd.http.request.max_chunk = 20097152
tsd.query.filter.expansion_limit = 16184
tsd.query.skip_unresolved_tagvs = true
tsd.core.auto_create_metrics = true
tsd.core.meta.enable_realtime_ts = false
tsd.core.meta.enable_realtime_uid = false
tsd.core.meta.enable_tsuid_incrementing = false
tsd.core.meta.enable_tsuid_tracking = false
tsd.core.tree.enable_processing = false
tsd.core.uid.random_metrics = false
tsd.storage.enable_appends = false
tsd.storage.enable_compaction = false
tsd.storage.fix_duplicates = true
tsd.storage.flush_interval = 10000
tsd.storage.hbase.prefetch_meta = true
tsd.storage.hbase.scanner.maxNumRows = 512
tsd.storage.max_tags = 12
tsd.storage.repair_appends = true
tsd.storage.salt.width = 1
tsd.storage.uid.width.metric = 4
tsd.storage.uid.width.tagk = 4
tsd.storage.uid.width.tagv = 4
tsd.storage.use_otsdb_timestamp = false
tsd.uid.lru.enable = true

Excluding a few basic configs (‘cause nobody needs to see that I picked the default port, or what my ZK cluster looks like), we’ve had good luck with these settings on OpenTSDB 2.4. Some important notes:

network settings are all defaults, but please use async I/O, reuse addresses, and enable tcp no delay. Your servers will thank you.
Disable tree processing, meta table creation (tsd.core.meta.*), and compaction in the main TSD. All these can be done in separate processes which will let the primary TSD just run ingest/queries. Much faster.
Ensure you enable salting (tsd.storage.salt.width). It really improves distribution of metrics in the data table.
Increasing the max requested number of rows from HBase (tsd.storage.hbase.scanner.maxNumRows) means larger chunks come back from HBase at a time. This primarily helps with larger queries.
Appends cause a noticeable CPU spike. It’s probably best to avoid them (even if they do improve storage usage). Supposedly, OpenTSDB 3.0 will have a new append strategy that avoids this CPU overhead. Something to revisit then…
Date-Tier compaction in HBase sounds nice, but is broken in many versions of HBase (including 2.0.2, which is what Hortonworks currently ships). Don’t enable it or your data table will get sad very quickly. (Essentially, when regions split, the cleanup fails rather catastrophically… This problem is patched. I can’t find the relevant HBase Jira ticket, but 2.0.2 definitely exhibits it.)

This guide has some solid recommendations (including the afore-mentioned date-tier compaction), but a few thoughts with HBase 2.0 in mind:

HBase cache settings are not cluster-wide any more, many of them can be set on individual tables. I’ll cover that more below.
The memstore suggestions are a little…aggressive. I’ll cover it more below.

HBase config

Largely leaving default configs (if you’re using Ambari/Cloudera Manager) is safe, but some suggestions:

Use a smaller than default memstore flush size (Ambari’s default of 128MB is a bit high). We’re using 64MB and the smaller writes do help. OpenTSDB devs recommend 16 MB, which Ambari won’t allow you to set.
Ensure hbase.bucketcache.size is at least 4GB (4096). This allows for better caching of data as we write.
The data table should have caching enabled on write alter 'tsdb', {NAME=>'t', CACHE_BLOOMS_ON_WRITE=>true, CACHE_INDEX_ON_WRITE=>true, CACHE_BLOCKS_ON_WRITE=>true} (or something, but that’s a good start.)
Absolutely use LZO compression (if you can enable it), it competes with snappy fine and gives a nice reduction in storage.

MOST IMPORTANT

Pre-splitting Data table

All of these tweaks help, but the most important thing you can do to support a properly functioning OpenTSDB cluster is to pre-split your HBase regions. OpenTSDB docs recommend this, but don’t go into many details on how to get a successful pre-split. What I found most effective was to set up the table (with salting enabled) and start ingesting some data. As the regions begin to grow:

Execute a full table split (echo 'split' | sudo -u hbase hbase shell)
Wait for some more data to populate and for the split regions to properly close.
Repeat.

If you are confident you can do the proper HBase pre-splitting (calculating region names, etc.) with OpenTSDB’s hashed row keys, more power to you. With our intention of growth for this database, I don’t know where proper splits will be, so using at least some actual data made my splits fairly accurate. We aren’t seeing much hot-spotting. Queries are significantly faster. General cluster load is slightly lower. In fact, a pre-split table also has better success compacting regions, which brings us to…

HBase compaction

Because of the significant changes in HBase 2.0, ensuring region compaction can complete is vital. With hbck somewhat neutered by the upgrade (and documentation around the new hbck 2.0 being rather hand-wavy at present), the compaction queue is your best friend in ensuring the health of your tables.

Compaction:

cleans up closed regions
it helps finish region splits
it ensures data locality, etc.

If your compaction queue is backing up, you’re going to have a bad time.

The strangest part is, when there are regions marked “other” in HBase’s Master UI, interacting with a table slows down significantly. alter statements time out, loading the table detail page in the UI takes significantly longer, and general table performance drops. The only way I’ve found in HBase 2.0 to truly remove “other” regions (which appear to be regions marked CLOSED that haven’t been actually closed yet) is letting the compaction queue run. Even disabling and re-enabling the table will not remove these regions. It can take a noticeable amount of time to clean up regions like this. Even nominally healthy clusters (nicely pre-split, current compaction queues staying around 90%+ complete, etc.) can take 30+ minutes to remove 1 CLOSED region properly.

The problem here is that there isn’t much we can do to help the compaction queues. We can run catalogjanitor (helps close split regions) and cleaner_chore (forces WALs to write and cycle) on our cluster, which can address some issues, but generally, we just have to be patient. Master failover (like in previous versions) can also help, but there is no obvious “silver bullet” to address compaction problems.

It is possible this is an artifact of a bad config on my part that I haven’t discovered. Or perhaps it has been addressed in >=2.1, but as it stands, ensuring compaction success seems to largely rely on heavy planning for how your table will be used. Pre-splitting is the best tool to ensure this.

Updates

Since I originally wrote this, we have seen a few improvements:

We’ve installed https://github.com/G-Research/opentsdb-tsuid-ratelimiter into our TSDs, which gave us a functioning meta table. You can read more about tsuid-ratelimiter here: https://www.gresearch.co.uk/article/opentsdb-meta-cache-trade-offs-for-performance/. Because of how this plugin works, you should try to pre-split your meta table as well. It doesn’t need to be nearly as split as the raw data table, but it definitely helps handle the initial write influx.
We’ve enabled rollup tables to help offset larger queries: http://opentsdb.net/docs/build/html/user_guide/rollups.html. To handle the amount of ingest we’re doing (~750kpps), we had to write our own rollup tool, which we’re currently working on open-sourcing.

Rollups are critical as your OpenTSDB installation grows. Having separate tables at reasonable intervals allows Graphite-like TTLs (different levels expire at different times) and the ability to query coarser-grained data for longer time ranges, which really helps tools like Grafana.

We’ve had to patch some functionality into the Grafana OpenTSDB plugin. We haven’t had time to “upgrade the backend” as the Grafana project requested for these changes to be properly merged (https://github.com/grafana/grafana/pull/15271#pullrequestreview-200578962).
We’re also quite excited for OpenTSDB 3.0, which promises a caching layer (huge win for frequent queries), Elasticsearch-backed meta information (and generally improved metadata creation/management), and overall code/performance improvements.