Does Hadoop really use the concept of parallelism while uploading files to the cluster?

Gursimar Singh
3 min readJan 11, 2021

A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. Such clusters run Hadoop’s open source distributed processing software on low-cost commodity computers.

ADVANTAGES OF APACHE HADOOP:

1. Varied Data Sources

Hadoop accepts a variety of data. Data can come from a range of sources like email conversation, social media etc. and can be of structured or unstructured form. Hadoop can derive value from diverse data. Hadoop can accept data in a text file, XML file, images, CSV files etc.

2. Cost-effective

Hadoop is an economical solution as it uses a cluster of commodity hardware to store data. Commodity hardware is cheap machines hence the cost of adding nodes to the framework is not much high.

3. Flexible

Hadoop enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data.

4. Fast

Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster. The tools for data processing are often on the same servers where the data is located, resulting in much faster data processing

5. Resilient to failure

A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes in the cluster, which means that in the event of failure, there is another copy available for use.

We take the help of tcpdump command to check what really happens in the background

First, we set up a hadoop cluster.

Second, we upload a file from the client

We monitor the data nodes at the same time using the tcpdump command to get the packet details.

We monitor the packets details such as the source and destination IP addresses and the data and come to the final conclusion.

From the above image it is clear that the client is the one who uploads the data directly to the DataNode without going to the MasterNode.

We repeat the same process while reading the data from the client and find out that client is the one who reads the data directly from the DataNode without going to the MasterNode.

The final verdict:

We found that client uploads data only in the first datanode and the further replications are done by the datanode i.e, client uploads file to the first data node and at the same time the first datanode creates a replica in the second datanode and the second datanode creates a replica in the third datanode. This process goes on according to the replication factor.

We also noticed that the first datanode pings the namenode to show that its in the live state and the second datanode pings the first datanode to show that its live and the process goes on.

The final main point is that Hadoop doesn’t use the concepts of parallelism for uploading files.

--

--

Gursimar Singh

Google Developers Educator | Speaker | Consultant | Author @ freeCodeCamp | DevOps | Cloud Computing | Data Science and more