The scientific publication WIREs Data Mining and Knowledge Discovery accepts a scientific paper from the CADC
Julio 11, 2017
Get to know the talented data analysts we have at everis!
Septiembre 28, 2017

Containerized BigData in Mesos and Kubernetes

Currently, containers are, probably, the hottest technology companies are beginning to use. Everyone that learns about containers and their server immutability concept invariably loves them, because of the power they give, both on the development and in the deployment sides.

As containers become more popular each day, more technology is being developed to help unleash all the power containers can offer. One of the most useful technologies created around containers in the deployment side is orchestrators.

A container orchestrator is a software that is able to orchestrate and organize containers across any number of physical or virtual nodes. The potential of this is enormous, as it greatly simplifies the deployment of the infrastructure (each node only has Docker installed) and its operation (as all servers are exactly equal). The container orchestrator takes in account things as failing nodes, adding more nodes to the cluster or removing nodes from the cluster by moving the containers from one node to another to keep them available at all times.

This kind of technology not only resolves a lot of problems related with infrastructure management, but also resolves a lot of problems regarding deploying software in the infrastructure, as the software, its dependencies and runtime are always deployed at the same time, minimizing the errors commonly associated to deployments in non-immutable environments.

As a lot of advantages of containers and orchestrators are in the distributed computing domain, they seem very interesting for the BigData world, which relays on enormous amounts of distributed computing power.

However, there have been some trouble for the BigData tools to immediately start thriving in the container world because of a reason: as BigData tools are older than the container fever, they provide their own native clustering behaviour.

As BigData tools such as Hadoop has its own clustering system (such as Yarn), making them able to delegate the clustering functions to an external tool such a container orchestrator is not easy. Besides that, most of the more important BigData tools (Hadoop, Kafka, Spark and such) are stateful applications, which complicates ever more their deployment in container orchestrators, which excel with stateless, non-natively-distributed applications.

However, the advantages containerization offers BigData tools are too much to be just ignored. Even if the only purpose is to be able to securely share the hardware between different applications, it is worth the effort needed to be done to containerize and orchestrate this kind of applications on a container orchestrator.

And such, there are a lot of efforts adapting the tools to be executed in the two biggest container orchestrators: Mesos and Kubernetes.

Apache Mesos is much more than a container orchestrator, it is actually a distributed computing library and execution runtime that abstracts the inner complexities inherent to distributed programming. It is very well known because it manages all of Twitter’s infrastructure.

Mesos entered the BigData world because Spark supports Mesos as it is what it is called a framework, an application built for Mesos and able to use all its advantages without the need of containerization.

Since then, and taking in account that Mesos is currently second in the container orchestrator’s race for the market leadership, it has focused in supporting BigData Java/Scala software from the bottom up. The company that is commercializing the Open Source software, called Mesosphere, is devoting a lot of effort converting existing and well known open source tools such as Hadoop, Kafka or Cassandra to Mesos’ frameworks so they execute natively on a Mesos cluster.

Given the clusterization and stateful requirements, Mesosphere has a point: making Mesos native frameworks seems more straightforward than trying to squeeze the BigData tools to fit in Docker containers.

Of course, Mesos can also support the execution of containerized loads with the help of frameworks devoted to container orchestration, such as Marathon so the same infrastructure can be shared by the BigData tools, other containerized applications or our own in-house developed applications or services.

Kubernetes, on the other side, has just recently begun to turn its attention to the BigData world. Kubernetes is currently the container orchestration leader and only focus on execution of containerized loads. It does not have the native application execution capabilities of Mesos: it is just a container orchestrator.

Currently, most of the efforts to support BigData tools in Kubernetes is being done by the community, by using an object that appeared first in Kubernetes version 1.5 called StatefulSets. This object is designed to deploy stateful apps, normally legacy apps that therefore can benefit of some of the features Kubernetes offers to the applications.

By deploying BigData tools in StatefulSets, they can benefit of HA (as if the node where the container is fails, the scheduler will quickly move the container to another node) and hardware sharing, which are not small features.

In conclusion, the current BigData tools can benefit from deploying them in container orchestrators, but we could benefit largely of the appearance of new natively containerized BigData tools, a very interesting market I am sure we will see in the months and years to come.

Posted by Juan Larriba, DevOps Engineer en everis.