There is potential, when working with faulty hardware, that your HBase cluster can crash in weird ways, causing the hbase:meta region to report two different regions with the same start or end key. In HBase < 2.0, the
hbck tool could fix these overlapping regions, but since that tool is disabled in HBase >= 2.0, how do we fix the problem?
Bringing it up means I’ve experienced it, right? Anyway…
About a month ago, we started experiencing all kinds of stability/performance problems in our HBase cluster. Writes were randomly failing, splits weren’t working correctly, things were generally in a very bad state. At first, I assumed the performance problems were due to data locality issues, but it turns out I had overlapping regions. I finally noticed this when running a R/O
hbase hbck on the cluster (because, while the tool can no longer alter things, it is still a useful diagnostic tool).
In the link above, we can see that we could write a small program to discover and repair overlaps. The problem is, I’m not much of a java programmer, and testing on my production cluster didn’t feel like a good idea. So how can we discover and repair overlaps?
Discovery, was, for us, fairly easy:
We can see above that the first two regions both have an empty start key. Since all tables must have only one region with an empty start key, we clearly found an overlap. And looking further down the list, the end key of
6481bfb215b0823fbafae0d16d930686 is the start key of
a6495ee2ce61ffa8991db0957cc41376, so that may be the end of our overlap.
So now to start merging regions. It seems an easy first step would be to merge the regions between
a64.... It should be fairly quick and since they all fit between our two overlap boundaries, shouldn’t cause any additional problems.
First, let’s try to merge
19b0457aa365a273c2afac63c5542669. We can see that
7df... has the same end key as
19b...’s start key, so they should successfully merge. Unfortunately,
7df... has a “merge qualifier” set and can’t currently merge with other regions.
Alright, let’s try the other end of the overlap: we can try merging
a64..... The start/end keys line up, and given how small
a64... is, merging the two should complete very quickly:
hbase(main):020:0> merge_region '6af21facff7811e9aa1fbf6ffd9b0e11','a6495ee2ce61ffa8991db0957cc41376' Took 0.2428 seconds
A successful merge!
We can see that the start key of the first region and the end key of the second region are now in one region.
With this change, the only region causing problems is
648.... And its end key doesn’t line up with any other existing start keys. Given that, it is essentially a broken region. Let’s just unassign it.
hbase(main):022:0> unassign '6481bfb215b0823fbafae0d16d930686' Took 1.2911 seconds
And now a
hbck run shows no more overlap. Huzzah!
(In hindsight, It likely would have been simpler to just close
648... in the first place. This likely would have solved the problem in one step, as we can see a consistent start/end key connection between all the other regions. But I suppose that’s what hindsight is for?)