Configuring Hadoop Cluster using Ansible

3 min readJan 9, 2021

Ansible offers a simple architecture that doesn’t require special software to be installed on nodes. It also provides a robust set of features and built-in modules which facilitate writing automation scripts.

Ansible is a software tool that provides simple but powerful automation for cross-platform computer support. It is primarily intended for IT professionals, who use it for application deployment, updates on workstations and servers, cloud provisioning, configuration management, intra-service orchestration, and nearly anything a systems administrator does on a weekly or daily basis. Ansible doesn’t depend on agent software and has no additional security infrastructure, so it’s easy to deploy.

Because Ansible is all about automation, it requires instructions to accomplish each job. With everything written down in simple script form, it’s easy to do version control. The practical result of this is a major contribution to the “infrastructure as code” movement in IT: the idea that the maintenance of server and client infrastructure can and should be treated the same as software development, with repositories of self-documenting, proven, and executable solutions capable of running an organization regardless of staff changes.

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Let’s Get Started

Task Description📄

🔰 11.1 Configure Hadoop and start cluster services using Ansible Playbook

We need one master node and at least one slave node to perform the task.

Before we start, we first have to configure the ansible.cfg and inventory file.

We check connectivity between the hosts:

ansible all -m ping

Now we create the ansible playbook for namenode and datanode in YML format

GitHub url for the playbook:

arth/Hadoop using Ansible at main · gursimarh/arth (github.com)

We can also check the syntax of playbook:

ansible-playbook --syntax-check <playbook_name>

Now Lets run the playbooks:

The Hadoop cluster have been configured successfully.

We can check the services is running or not in both the Namenode and Datanode

jps

Now, To check the report of hadoop cluster:

hadoop dfsadmin -report

This is how we can setup a hadoop cluster using ansible.

Configuring Hadoop Cluster using Ansible

Let’s Get Started

GitHub url for the playbook:

Written by Gursimar Singh

No responses yet