Hive's metadata store is brittle and it should feel bad

We recently had an outage at $JOB. While this can happen and I’m not upset about it, I am upset with how Hive handled the outage.

Hive’s Metastore

Hive is comprised of two primary services: the hiveserver2 process and the metastore process. The hiveserver2 process handles the actual interactions with HDFS and with users, while the metastore process handles knowing what data exists for what tables in Hive.

The metastore process requires a coordinating DB for tracking all this meta information. Currently at $JOB, we use a MySQL instance for that. Specifically, we use a hot/cold failover pair of MySQL instances behind a floating IP. This should mean that if a instance fails, the IP fails to the other instance and we have at most a momentary interruption of service.

What happened

So during the outage, we lost our active MySQL instance. This is okay and the failover occurred as expected. We had to restart some streaming services writing to Hive (but those services are relatively brittle anyway, so this was no surprise), but otherwise the failover worked as expected.

What got weird is when the failed node rejoined. It isn’t clear whether the node rejoining caused another failover because Hive somehow wrote the wrong data into the DB and suddenly no reads or writes could happen.

What happened (more details)

Hive’s metastore database primarily tracks partition locations and compaction work for the tables. Importantly, it also tracks the changes that occur in the DB so any federation/backup Hive clusters can successfully receive the data. This particular list of changes lives in the NOTIFICATION_LOG table. In order to track changes in this table, Hive uses a column called NL_ID. While this should probably be an auto-increment column, it is instead incremented by tracking the position in the SEQUENCE_TABLE. Somehow, the position in SEQUENCE_TABLE was decremented, which meant that any attempt to update the NOTIFICATION_LOG table was met with the dreaded:

021-04-08T02:27:16,348 ERROR [pool-6-thread-145]: metastore.RetryingHMSHandler (RetryingHMSHandler.java:invokeInternal(201)) - MetaException(message:Unable to execute direct SQL java.sql.SQLIntegrityConstraintViolationException: Duplicate entry '48191281' for key 'PRIMARY'

This failure even occurred when attempting to describe a table within Hive’s CLI!

Fixing the problem was (fortunately) as simple as correctly incrementing the value in SEQUENCE_TABLE to be higher than the largest value for NL_ID in NOTIFICATION_LOG.

Why am I frustrated?

Generally speaking, Hadoop ecosystem tools are very robust. HDFS can handle a lot of failures before any noticable problems occur. While HBase does have its problems, recovery is effective. But Hive tooling is…strange. Some problems:

Hive 3.1 still executes MR-style jobs for compaction rather than submitting proper yarn jobs.
Hive’s heavy use of the metadata DB means our MySQL instance (which we didn’t want to have installed in the first place) needs to be relatively powerful hardware on a busy cluster. Some of our performance problems come down to undersized SQL instances struggling with the number of queries thrown at them.
What kind of crazy race condition has to occur for a program to write an old ID to a primary key on a table? The fact that this can happen to Hive is another argument for migrating away…