Sergey Nivens - Fotolia
OpenStack is a cloud operating system used to provision VMs, while OpenStack Sahara is an add-on component that lets administrators deploy Spark and Hadoop on top of those VMs. In other words, you can use OpenStack Sahara as a central spot from which to build out your Hadoop and Spark-distributed architecture to do big data analytics.
Sahara itself has plug-ins for different vendor distributions of Hadoop and Spark:
- Vanilla: Apache Hadoop;
- Ambari: Hortonworks Hadoop;
- Spark: Apache Spark with Cloudera HDFS, e.g., Apache Spark with Cloudera Hadoop;
- MapR: MapR plugin and MapR File System, e.g., quasi-front-end for both Hadoop and Spark and
- Cloudera: Cloudera Hadoop distribution.
Technically, you don't need Hadoop to run Spark, but since Spark is designed to allocate storage across a distributed architecture, Hadoop is ideal for that. Plus, Spark has no storage mechanism of its own.
In the architecture, Sahara runs on OpenStack controller nodes, and the Hadoop cluster runs on OpenStack Compute nodes.
Of course, there are other ways to deploy Hadoop, like with Docker containers, or you can install it manually on virtual or physical machines. Tools such as Ansible or Puppet make this easier. There are also various vendor options and vendor-assisted tools, such as Cloudera and MapR. Plus, you can run Hadoop in the cloud with different cloud vendors.
Using OpenStack Sahara provides a central point from which to deploy and launch Hadoop and assign a Hadoop role to each VM. And, as an open source product -- one that is not tied to any vendor -- you get support from OpenStack contributors like RedHat, Ubuntu, Suse, HP, Workday, SAP, Intel and others.
You can install OpenStack on a single machine to test it before making any commitments. There are several ways to do this. You can use Packstack for RHEL or CentOS, also called RDO. Or you can use DevStack for Fedora, Ubuntu and CentOS. You can also use Mirantis Fuel on Ubuntu.
The first step is to upload VM images to OpenStack Glance. You can use the Horizon dashboard or Glance command line to do that. For the VM, you need an image that has cloud-init available. Cloud-init facilitates deployment to the cloud by generating secure shell keys, setting the default locate and setting a hostname.
As with other OpenStack components, you can use the command-line interface or you can use the Horizon dashboard. The dashboard is obviously easier. Either way, you will have to install Sahara. Though it's a long process, OpenStack provides thorough instructions for doing so on its website.
The basic steps to deploy Hadoop start with configuring and deploying VMs, only after determining what Hadoop role you want each VM to serve. It could be any of the following:
- Namenode: Stores details of the Hadoop Distributed File System (HDFS) and runs JobTracker;
- Datanode: A piece of the HDFS that runs jobs;
- Secondary name node: Serves as a name node backup in case name node fails;
- Oozie: A workflow scheduler;
- Resource manager: Uses Apache, Yarn or Mesos to allocate resources -- memory and CPU;
- Node manager: Coordinates the role of each server -- node -- in the Hadoop system and
- Job history server: Keeps tabs on the execution of MapReduce and other jobs, and reschedules them as necessary.
To continue deploying Hadoop with OpenStack Sahara, upload a VM image such as Ubuntu with cloud-init. Next, register an image with Safari and add tags that match the plug-in you are using, like Vanilla. When you add Sahara to Horizon, this option becomes available in the Dashboard. Like VM templates, Node Groups are templates with the same RAM and CPU characteristics, for example, m1.medium. Finally, combine Node Group templates into the Cluster Template.
Once you've completed these steps, create the instance in Horizon and set up master and worker Hadoop nodes. Then, launch cluster -- instances. From there, you can create a Hadoop job. It could be Spark, Pig, Java, MapReduce or other. Then, launch the job on cluster. Lastly, output the results to Cinder or other storage.
Spark, Hadoop and Sahara
OpenStack Sahara is not an enhancement to Hadoop or Spark. Instead, you can think of it as a graphical or command-line tool to make it easier to build out a distributed Hadoop or Spark system. Not only does it help with installing those systems; it keeps track of what server is serving which function. So, you can go to one screen and see the entire layout.
The big difficulty with installing Spark or Hadoop without OpenStack Sahara is you would need to install the VMs manually. Sahara allows you to skip that step, and then you can install Hadoop and Spark on top of it. And, it lets you assign a role to each server, so you know which ones are storing data, which ones are gathering data and which ones are coordinating all of that activity. Once you have all this nailed down, you can repeat the process when you need to scale. This is because you can save your ideas as templates, just as OpenStack saves different VM configurations as templates.
OpenStack Sahara also helps you keep track of other complex details and incorporate them into decision making. For example, a guiding principle behind Hadoop is that it makes three copies of each piece of data for redundancy. As such, it wouldn't make sense to put all of that on the same machine, power source or rack. Sahara helps because it inherently knows about data center rack configuration and will let you either keep that data together, for performance, or keep it separated, for redundancy.
Taken together, Sahara makes installing Hadoop and Spark easier for those who are already using OpenStack. Of course, you can use Puppet, Ansible or Docker, but none of those are cloud operating systems.
Explore OpenStack and its inner workings
Learn about OpenStack and containers
Deploy OpenStack with ease