Everything you need to know about Hadoop Shell & Master Big Data

Hadoop Shell is a Linux like terminal utility that can be used to interact with Hadoop’s distributed file system. For Linux users it will be a familiar interface and functionality with only difference in behind the scene operations that are performed. Hadoop shell runs every command on a distributed cluster instead of a single computer. Whether we have set up Hadoop on our single machine; a laptop or desktop or Macbook, the core functionality of Hadoop will remain same across operating systems and so will be the user experience.

If you haven’t setup apache Hadoop on your computer yet, then you should checkout our related posts How to Quickly Setup Apache Hadoop on Windows PC for windows users & How to setup Apache Hadoop Cluster on a Mac or Linux Computer for our MacBook or Linux users.

Quick side note, here is a list of related posts that we recommend you to read:

6 Reasons Why Hadoop is THE Best Choice for Big Data Applications – This article explains why Hadoop is the leader in the market and will be one for long time.
Why Large number of files on Hadoop is a problem and how to fix it? – This is highly recommended for anyone working on Hadoop or looking to work on it in future.
Integrate ElasticSearch with Hadoop Technologies – This is actually an intro to a very comprehensive course on integrating Hadoop with ElasticSearch. This is one of the key skills for advancing your data engineering career today.
How to Quickly Setup Apache Hadoop on Windows PC – If you are a windows PC user and landed on this page, then we have detailed instructions on how to install Apache Hadoop on Windows PC in this post.
Installing Spark – Scala – SBT (S3) on Windows PC – If you want to learn Spark then this article will help you get started with it.

Next we will dig deeper into various commands that Hadoop shell offers. As of this article, we have used Hadoop 2.7.3 version for list of featured commands. Again, remember that each of Hadoop shell commands run on distributed network of machines and not a single machine. Even if you have installed Hadoop on your personal computer, internally Hadoop will run commands in much the same way it would do on a Hadoop cluster. If you are new to Hadoop, we highly recommend you to check above suggested articles (especially the ones with Hadoop in title).

We would like to recommend you this book which is very good for beginners.

Let’s move to specific commands in Hadoop Shell. This lesson will be more effective if you try these commands in parallel. If you haven’t setup Hadoop yet or need help in setting it up on your personal computer, then please check above suggested step by step guides for Mac, Linux or Windows PCs.

If you want to learn more about how Hadoop works, then we recommend you to check: Hadoop – The Definitive Guide

Getting Help in Hadoop Shell

There are many different releases of Hadoop available which you can find here. As Hadoop’s community keeps adding new features regularly, so it’s advised to look at recent release details to get familiar with latest updates and commands.

The most common operation in any program is to understand how to get help when needed. Hadoop shell also comes with a help feature. There are two ways to get help about a command in Hadoop Shell:

using help command
using usage command

In its simplest form, help command can provide us entire help documentation of Hadoop Shell. It will print the entire list of commands available with all optional parameters supported by those commands.

COMMAND: help

Syntax

hadoop fs –help

If we do not prefer to dump entire help documentation, but interested only in specific commands, then run it as

$> $ hadoop fs -help ls

You might also like: Efficient Process Communication in Python: A Comprehensive Guide

COMMAND: usage

Syntax

hadoop fs -usage <command>

Let’s try it in terminal as

$> hadoop fs -usage ls

COMMAND: mkdir

mkdir command is similar to Linux shell mkdir command, which can be used to create specific directories on hadoop file system.

Syntax

hadoop fs -mkdir [-p] <paths>

Where <paths> is the absolute path of one or more HDFS directories that we would like to create. The -p option enables creating parent and children directories in single command.

COMMAND: chmod

The command -chmod is used to change the permissions of a file or directory. The -R option recursively sets the same permissions to all children directories and files.

Syntax

hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]

COMMAND: chown

The command -chown is used to change the ownership of a file or directory. The -R option recursively sets the same ownership of all children directories and files.

Syntax

hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

COMMAND: chgrp

The command -chgrp is used to change the group user of a file or directory. The -R option recursively sets the same group for all children directories and files.

Syntax

hadoop fs -chgrp [-R] GROUP URI [URI ...]

COMMAND: copyFromLocal

Syntax

hadoop fs -copyFromLocal <localsrc> URI

The command –copyFromLocal is used to copy files from local file system to hadoop’s distributed file system. Here URI is Hadoop’s absolute path.

COMMAND: copyToLocal

Syntax

hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

The command –copyToLocal is used to copy files to local file system from hadoop’s distributed file system.

COMMAND: cp

Syntax

hadoop fs -cp [-f] [-p | -p[topax]] URI [URI ...] <dest>

The command –cp is used to copy files between hadoop’s distributed file system locations.

The key feature of –cp command in Hadoop shell; is that it allows us to retain the original file’s attributes after copying. For instance file’s attributes; timestamps, ownership, permission, ACL & XAttr are retained on destination file when –p[topax] option is provided in the cp command. There’s also another option –f; forced copy which allows overwriting existing destination file without throwing any errors like it would in case –f option isn’t provided in the command.

COMMAND: put

Syntax

hadoop fs -put <localsrc> ... <dst>

The command –put is similar to copyFromLocal command, used to copy files from local to HDFS destination.

COMMAND: get

Syntax

hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>

The command –get is similar to copyToLocal command, used to copy files from HDFS to local destination.

COMMAND: moveFromLocal

Syntax

hadoop fs -moveFromLocal <localsrc> <dst>

The command –moveFromLocal is used to move files from Local file system to hadoop’s distributed file system targets.

COMMAND: moveToLocal

Syntax

hadoop fs -moveToLocal [-crc] <src> <dst>

The command –moveToLocal is used to move files from hadoop’s distributed file system to local drives. As of hadoop 2.7.3 release this command is just a placeholder and isn’t implemented yet. But it’s good to have some idea about what features hadoop is planning to launch in future.

COMMAND: mv

Syntax

hadoop fs -mv URI [URI ...] <dest>

The command –mv is used to move files from source HDFS to destination HDFS. This command allows multiple sources as well in which case the destination needs to be a directory location. Moving files across file systems is not permitted.

COMMAND: cat

Syntax

hadoop fs -cat URI [URI ...]

The command –cat is used to view the contents of file(s) stored on hadoop’s distributed file system targets and print them on the screen.

COMMAND: tail

Syntax

hadoop fs -tail [-f] URI

The command –tail is used to display 1 KiloByte of data from end of the HDFS file. The option -f allows streaming new content in the file.

COMMAND: text

Syntax

hadoop fs -text <src>

The command –text is used to review contents of optimized files such as compressed ones on HDFS.

COMMAND: checksum

Syntax

hadoop fs -checksum URI

The command –checksum is used to compute checksum value of a file stored on hadoop’s distributed file system.

WHAT IS CHECKSUM

Checksum is basically a number computed by running mathematical operations on digital content of a file. If the content of the files changes or are different from source file, then re-calculation of its checksum will yield a different number and that’s how files can be validated for accuracy after transmission by comparing source file’s checksum and destination file’s checksum.

An important point to understand here is that –checksum command works on HDFS files only and uses different algorithm than what is used on Linux shell. In other words, if you copy a file from local file system to HDFS file system and expect that their checksums will match, then it won’t because Linux’s -cksum command uses a different algorithm to compute checksum value than HDFS system.

You might also like: Top 18 Python Interview Questions and Solutions for Success

HDFS uses MD5-of-0MD5-of-512CRC32C as of current 2.7.3 release while Linux uses its native md5sum method.

But it is still possible to compare the checksum values of a file between Local FS and HDFS using a simple trick. Here’s how you can do it.

hadoop fs -cat /learn/hdfs/admin/compressed.txt.gz | cksum

Although this trick works in most cases, but it’s not recommended for large files because you are basically streaming (-cat command) entire contents of the HDFS file and sending it to a Linux shell which will transmit entire data between systems and consume network bandwidth. So, use this approach only for small size files if you have to. If the files are large then it’s better to enhance end application to perform validation before reading the files.

COMMAND: appendToFile

Syntax

hadoop fs -appendToFile <localsrc> ... <dst>

The command –appendToFile is used to append new content on existing files stored on HDFS.

COMMAND: ls

Syntax

hadoop fs -ls [-d] [-h] [-R] [-t] [-S] [-r] [-u] <args>

The command –ls is used to list the files or directories stored at HDFS.

Options:

-d: Directories are listed as plain files.
-h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864).
-R: Recursively list subdirectory encountered.
-t: Sort output by modification time (most recent first).
-S: Sort output by file size.
-r: Reverse the sort order.
-u: Use access time rather than modification time for display and sorting.

COMMAND: find

Syntax

hadoop fs -find <path> ... <expression> ...

The command –find is used to locate files on hadoop’s distributed file system.

Many people struggle with using -find command in Hadoop and its very simple actually. Here’s an example.

hadoop fs -find /learn/hdfs -name *hadoop* -print

Here, we want to search all files under /learn/hdfs location; which are named as “hadoop”. The results will be case-sensitive. So, if there are files with “Hadoop” instead of “hadoop” in name, then those won’t be shown in the result.

COMMAND: getmerge

Syntax

hadoop fs -getmerge <src> <localdst> [addnl]

The command –getmerge is used to combine multiple input files into a single file at the destination. This command works from HDFS to local drive only and not vice versa or HDFS to HDFS as of current hadoop release.

COMMAND: stat

Syntax

hadoop fs -stat [format] <path> ...

The command –stat is used to generate usage reports for hadoop distributed file system.

The optional parameter [format] provides ways to define layout of the desired report. Here’s are various options in [format] parameter.

(%b) – file size in blocks

(%F) – type

(%g) – group name of owner

(%n) – name

(%o) – block size

(%r) – replication

(%u) – user name of owner

(%y or %Y) – modification date. %y shows UTC date as “yyyy-MM-dd HH:mm:ss” and %Y shows milliseconds since January 1, 1970 UTC. If the format is not specified, %y is used by default.

This is also less known and used command but it can be greatly useful for administrators. Here’s an example of it.

hadoop fs -stat "%F,%u,%g,%b,%r,%y,%n" /learn/hdfs

It will generate a report like this:

Type,User,Group,Size,Replication,Modification Date,Item

regular file,root,dev,8497,1,2018-02-04 18:56:56,Hadoop.txt

directory,root,dev,0,0,2018-02-04 20:06:33,data

regular file,root,dev,3872,1,2018-01-30 23:38:21,fs.txt

regular file,root,dev,8497,1,2018-01-30 20:23:04,hadoop.txt

regular file,root,dev,8497,1,2018-01-30 22:50:15,hadoop2.txt

regular file,root,dev,8497,1,2018-01-30 22:51:04,hadoop3.txt

COMMAND: count

Syntax

hadoop fs -count [-q] [-h] [-v] <paths>

The command –count is used to count files and directories on hadoop’s distributed file system.

You might also like: Why Large number of files on Hadoop is a problem and how to fix it?

When used with -q option it provides report in following format by default:

QUOTA  REM_QUOTA  SPACE_QUOTA REM_SPACE_QUOTA  DIR_COUNT  FILE_COUNT  CONTENT_SIZE PATHNAME

It is easier to read the file sizes if -h option is used which basically converts bytes to MBs or GBs or TBs.

COMMAND: df

Syntax

hadoop fs -df [-h] URI [URI ...]

The command –df is used to display available free space on hadoop distributed file system.

A sample report would look like this

Filesystem                                 Size        Used          Available     Use%

hdfs://sandbox-hdp.hortonworks.com:8020  41.6 G        1.6 G      24.1 G           4%

hdfs://sandbox-hdp.hortonworks.com:8020  41.6 G        1.6 G      24.1 G           4%

hdfs://sandbox-hdp.hortonworks.com:8020  41.6 G        1.6 G      24.1 G           4%

COMMAND: du

Syntax

hadoop fs -du [-s] [-h] URI [URI ...]

The command –du is used to generate disk usage statistics for hadoop distributed file system.

The –du command when used with–s option, it will provide aggregated summary of total files in a directory. It’s better to combine the command with –h option to display size column in human readable format.

Here’s a sample report

866 /learn/hdfs/admin

73.9K /learn/hdfs/development

29.5K /learn/hdfs/operations

COMMAND: rm

Syntax

hadoop fs -rm [-f] [-r |-R] [-skipTrash] URI [URI ...]

The command –rm is used to delete files or directories from HDFS drives.

Hadoop provides an optional parameter –skipTrash which as name suggests will permanently delete the file and will skip the step of moving it to trash directory. Otherwise all removed files are moved to .Trash directory.

COMMAND: expunge

Syntax

hadoop fs -expunge

The command –expunge is used to clean .Trash directory on HDFS drives.

COMMAND: test

Syntax

hadoop fs -test -[defsz] URI

The command –test is used to perform various file or directory level operations on HDFS.

Options:

-d: f the path is a directory, return 0.
-e: if the path exists, return 0.
-f: if the path is a file, return 0.
-s: if the path is not empty, return 0.
-z: if the file is zero length, return 0.

COMMAND: touchz

Syntax

hadoop fs -touchz URI [URI ...]

The command –touchz is used to create empty files directly on HDFS locations.

COMMAND: truncate

Syntax

hadoop fs -truncate [-w] <length> <paths>

The command –truncate is used to reduce file to a particular size in HDFS.

Hadoop provides an optional parameter –w with –truncate to make process wait for the completion. This is advised for smaller files but not larger files. Also remember that this is a dangerous command because truncated data will not be recoverable and be lost forever. So, be careful when using this command.

COMMAND: setrep

Syntax

hadoop fs -setrep [-R] [-w] <numReplicas> <path>

The command –setrep is used to set replication of a file or directory on HDFS.

When setrep is combined with –w option, it will wait for replication to finish which may take very long time depending upon number & size of files. So, until really necessary avoid using –w command. If replication factor is increased (say from 1 to 3), then Hadoop will create 2 copies of data and transfer them over the network to different data nodes. This may take while if there are too many files that are changing.

TOP PAYING JOBS REQUIRE THIS SKILL

ENROLL AT 90% OFF TODAY

Complete ElasticSearch Integration with LogStash, Hadoop, Hive, Pig, Kibana and MapReduce - DataSharkAcademy

This concludes all the shell commands that we wanted to cover in this article.

We hope that this post proves immensely helpful to you and your organization. In this article we have featured Everything you need to know about Hadoop Shell. We believe this will help you in starting your career as big data engineer.

If you liked this Everything you need to know about Hadoop Shell article, then do share it with your colleagues and friends. Do you have any questions or suggestions for us? please leave it in the comments section below.

Do not forget to sign up for our Free newsletter below.

Everything you need to know about Hadoop Shell