When I started at $JOB, they wanted a distributed system for data storage and someone with some experience maintaining one. And they particularly wanted Hadoop. So okay. Hadoop it is. Let’s do some Hadoop, right?
To start, Hadoop is complicated to deploy/manage. It’s not a single program, but an inter-connected suite of programs (hdfs, namenode, journalnode, zkfailover controller, yarn nodemanager, yarn resource manager, mapreduce history server, yarn timeline server, hive metastore, hiveserver2, zookeeper…) that takes some amount of knowledge and planning to deploy well. This barrier to entry is why Ambari, Cloudera Manager, MapR, BigTop, EMR, and other tools exist. Due to licensing concerns at the time, I selected Ambari to manage our cluster.
Ambari works fairly well. It makes producing consistent config (one of the harder parts of managing a heterogeneous Hadoop cluster) relatively straight-forward. And it helped us set up a reasonable cluster right out of the gate.
That’s the end of the story, right? We had a stable cluster. It slowly grew as people needed it. And today we have a large, performant Hadoop cluster serving our company’s BI and distributed storage needs.
Naturally, there’s more to tell…
The primary use for our Hadoop cluster was as storage for our OpenTSDB instance. Along with using OpenTSDB, we were using Hive to store/query some high volume logging data that our ElasticSearch cluster couldn’t keep up with indexing and snapshots of some databases to give cleaner access for BI use cases. All of these needs are met by our Hadoop cluster fairly well. So again, what’s the point of this article?
The amount of customization, unique code, and frustrating APIs that Hadoop requires/has is TOO DAMN HIGH! An example or two would help…
Example 1: Log ingestion
Ingesting log data into Hive requires a lot of work. And currently, a lot of custom code. Because Hive requires a schema for each table, we need to normalize/tweak data as it is ingested. We were initially writing data to HDFS and then using batch jobs to write to Hive, but ran into problems with speed and inconsistency (the batch jobs could not be successfully replayed). This lead us to streaming data directly into Hive.
Streaming worked great! Unfortunately, to get it truly functional required custom code that the team is now struggling to support. It also created some reduced functionality: we could no longer fully parse the log messages and instead relied on regexes/etc. to parse statements.
So our ingest pattern for logs into Hive needed work. This is solvable, right?
Example 2: HBase and OpenTSDB
As much as I appreciate HBase’s power/functionality, we could never truly land on a install that felt easy to maintain. Because regions only live on one node at a time, any failover seems to require draining the regionserver rather than simply stopping the service and letting another copy of the data handle requests (a la Kafka/Elasticsearch/etc.).
As for OpenTSDB, there are hard limits on the number of series that can be created in an OpenTSDB instance. While this limit is large (~4.5 billion in our case), with more rapid deployments, we have a good chance of coming close to or reaching this limit, which would require a complete re-deployment of our system.
Also caching. OpenTSDB has no built-in concept of a cache. While HBase can do some amount of caching (bloom filters/cache on write/etc.), proper query-based caching doesn’t exist. This means that all queries into OpenTSDB are essentially cold, which means performance frequently suffers. There is a caching solution for OpenTSDB (https://github.com/turn/splicer), but the project is abandoned, which leads me to:
OpenTSDB itself seems to be slowly dying. Releases aren’t happening, bugs go un-fixed, MRs unaccepted. The only people you see metion OpenTSDB when talking architecture say “we used to run OpenTSDB…”
So you probably need to dump OpenTSDB? That still doesn’t mean hadoop goes away, right?
Example 3: Hortonworks and Cloudera merger
Ambari is an open source product, but the underlying repositories it uses were maintained by Hortonworks. Hortonworks recently merged with Cloudera. In the process, Ambari’s licensing and upgrade path became very confusing. We’re now in a position where it isn’t clear we can legally use newer versions of Ambari without purchasing a license. Which means if we want to properly maintain our Hadoop cluster, we need to remove Ambari from the equation.
Cloudera has moved all distribution binaries into a private repo that requires a paid ($10k+ per server?!) Cloudera account to access. This makes it impossible for us to continue using any Cloudera/Ambari binaries without an onerous support cost. This change was announced last fall, but we (unfortunately) didn’t see the blog post about it. This change has brought discussions around what to do with Hadoop front-and-center. And the prognosis isn’t good.
The Point (Part 2)
So we have:
- complex custom code (Hive streaming/DB snapshots)
- brittle/dying tools (HBase/OpenTSDB)
- large additional maintenance work immenent (Ambari removal/manual cluster management).
These three problems combined suggest we should start evaluating alternatives and migration paths.
So are there other choices?
In the last 5 years (or so) talk has grown about “data lakes”. The concept isn’t really new, but the execution (and general conceptual definitions) has become more codified. There’s two primary setups you’ll see mentioned in articles about data lakes:
- HDFS storage with Spark for proper querying and hand-waving around the ad-hoc query UI and the ingest tooling.
- S3 storage with some hand-waving around the proper query API, the ad-hoc query UI, and the ingest tooling.
Both patterns recommend columnar storage (Parquet comes up more often, but ORC does show up occasionally) and you’ll often see Presto/Trino, Apache Arrow, and other tools like it mentioned.
But do any of these systems work? And in particular, can we build an environment using these tools that would simplify our current environment? The unfortunate thing is, I can’t honestly answer that right now. We’ve done some initial testing using MinIO as a data source, but have struggled to get data into our “lake” in a consistent format.
So can we actually address this problem? Can we find a good way to store high-volume data effeciently that still allows for reasonable query and aggregation functionalities? I think so. I think spending more time with tools that can successfully “stream” data into Parquet/ORC files on an S3-compatible storage system will give us a better solution. I think the work that Trino and the Apache Arrow folks are doing is making serious progress towards a new system for our “data lake”.
Rivers into Lakes…
Currently, the most challenging portion of the pattern is ingest. While there are plenty of tools that claim to be great at ingesting large volumes of data (Apache Flink, Apache Storm, Spark, etc.) Nearly all of them are essentially programming frameworks in Java. For those of us who aren’t so great with Java, ingest options are far more limited. Both Kafka Connect and Apache Nifi look promising. The challenge is both use an internal data framework that requires schemas (Avro for Kafka Connect and some type of internal schema management in Nifi).
But that’s only for streaming ingest. For batch ingest, the situation is much brighter! Apache Arrow and other libraries make producing good, searchable data in a data “lake” far simpler, even for the less programming inclinded population (like me!). Taking an Airflow job that currently collects and writes a batch of data to Hive and making it instead write a Parquet file on MinIO will require some dev work, but not so much that it is prohibative.
Nothing’s set in stone. We may or may not pursue this. Perhaps we just have all our “lake-like” data written to Snowflake instead? Maybe we do all our BI work in a single MySQL instance? (It certainly can be done that way…) Where ever we land, it likely won’t be with more Hadoop in our lives, and my on-call rotations will be infinitely quieter for it.