Repairing Overlapping Regions in HBase >= 2.0

There is potential, when working with faulty hardware, that your HBase cluster can crash in weird ways, causing the hbase:meta region to report two different regions with the same start or end key. In HBase < 2.0, the hbck tool could fix these overlapping regions, but since that tool is disabled in HBase >= 2.0, how do we fix the problem?

Bringing it up means I’ve experienced it, right? Anyway…

About a month ago, we started experiencing all kinds of stability/performance problems in our HBase cluster. Writes were randomly failing, splits weren’t working correctly, things were generally in a very bad state. At first, I assumed the performance problems were due to data locality issues, but it turns out I had overlapping regions. I finally noticed this when running a R/O hbase hbck on the cluster (because, while the tool can no longer alter things, it is still a useful diagnostic tool).

https://github.com/nabhosal/HBaseRecoveryTools/blob/master/doc/multi-region-start-with-sameKey.md

In the link above, we can see that we could write a small program to discover and repair overlaps. The problem is, I’m not much of a java programmer, and testing on my production cluster didn’t feel like a good idea. So how can we discover and repair overlaps?

Discovery, was, for us, fairly easy: overlapping regions

We can see above that the first two regions both have an empty start key. Since all tables must have only one region with an empty start key, we clearly found an overlap. And looking further down the list, the end key of 6481bfb215b0823fbafae0d16d930686 is the start key of a6495ee2ce61ffa8991db0957cc41376, so that may be the end of our overlap.

So now to start merging regions. It seems an easy first step would be to merge the regions between 648... and a64.... It should be fairly quick and since they all fit between our two overlap boundaries, shouldn’t cause any additional problems.

First, let’s try to merge 7df045b368434aeddf0e48b956bcc5aa with 19b0457aa365a273c2afac63c5542669. We can see that 7df... has the same end key as 19b...’s start key, so they should successfully merge. Unfortunately, 7df... has a “merge qualifier” set and can’t currently merge with other regions.

Alright, let’s try the other end of the overlap: we can try merging 6af... and a64..... The start/end keys line up, and given how small a64... is, merging the two should complete very quickly:

hbase(main):020:0> merge_region '6af21facff7811e9aa1fbf6ffd9b0e11','a6495ee2ce61ffa8991db0957cc41376'
Took 0.2428 seconds

A successful merge!

merged regions

We can see that the start key of the first region and the end key of the second region are now in one region.

With this change, the only region causing problems is 648.... And its end key doesn’t line up with any other existing start keys. Given that, it is essentially a broken region. Let’s just unassign it.

hbase(main):022:0> unassign '6481bfb215b0823fbafae0d16d930686'
Took 1.2911 seconds

And now a hbck run shows no more overlap. Huzzah!

(In hindsight, It likely would have been simpler to just close 648... in the first place. This likely would have solved the problem in one step, as we can see a consistent start/end key connection between all the other regions. But I suppose that’s what hindsight is for?)