Everything you need to know about Hadoop Shell

This post may contain affiliate links. Please read our disclosure for more info.

Everything About Hadoop Shell

Hadoop Shell is a Linux like terminal utility that can be used to interact with Hadoop’s distributed file system. For Linux users it will be a familiar interface and functionality with only difference in behind the scene operations that are performed. Hadoop shell runs every command on a distributed cluster instead of a single computer. Whether we have set up Hadoop on our single machine; a laptop or desktop or Macbook, the core functionality of Hadoop will remain same across operating systems and so will be the user experience.

 

If you haven’t setup apache Hadoop on your computer yet, then you should checkout our related posts How to Quickly Setup Apache Hadoop on Windows PC for windows users & How to setup Apache Hadoop Cluster on a Mac or Linux Computer for our MacBook or Linux users.

 

Quick side note, here is a list of related posts that we recommend you to read:

 

GET MORE LIKE THIS
DELIVERED RIGHT TO YOUR MAILBOX
we hate spams too, promise.

 

Next we will dig deeper into various commands that Hadoop shell offers. As of this article, we have used Hadoop 2.7.3 version for list of featured commands. Again, remember that each of Hadoop shell commands run on distributed network of machines and not a single machine. Even if you have installed Hadoop on your personal computer, internally Hadoop will run commands in much the same way it would do on a Hadoop cluster. If you are new to Hadoop, we highly recommend you to check above suggested articles (especially the ones with Hadoop in title).

We would like to recommend you this book which is very good for beginners.

Let’s move to specific commands in Hadoop Shell. This lesson will be more effective if you try these commands in parallel. If you haven’t setup Hadoop yet or need help in setting it up on your personal computer, then please check above suggested step by step guides for Mac, Linux or Windows PCs.

If you want to learn more about how Hadoop works, then we recommend you to check: Hadoop – The Definitive Guide

 

Getting Help in Hadoop Shell

There are many different releases of Hadoop available which you can find here. As Hadoop’s community keeps adding new features regularly, so it’s advised to look at recent release details to get familiar with latest updates and commands.

 

The most common operation in any program is to understand how to get help when needed. Hadoop shell also comes with a help feature. There are two ways to get help about a command in Hadoop Shell:

 

  1. using help command
  2. using usage command

 

In its simplest form, help command can provide us entire help documentation of Hadoop Shell. It will print the entire list of commands available with all optional parameters supported by those commands.

COMMAND: help
Syntax

hadoop fs –help

 

If we do not prefer to dump entire help documentation, but interested only in specific commands, then run it as

$> $ hadoop fs -help ls

 

You might also like:   Encryption and Cryptography in Python
COMMAND: usage
Syntax

hadoop fs -usage <command>

 

Let’s try it in terminal as

$> hadoop fs -usage ls

 

COMMAND: mkdir

mkdir command is similar to Linux shell mkdir command, which can be used to create specific directories on hadoop file system.

Syntax

hadoop fs -mkdir [-p] <paths>

 

Where <paths> is the absolute path of one or more HDFS directories that we would like to create. The -p option enables creating parent and children directories in single command.

 

COMMAND: chmod

The command -chmod is used to change the permissions of a file or directory. The -R option recursively sets the same permissions to all children directories and files.

Syntax

hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]

 

COMMAND: chown

The command -chown is used to change the ownership of a file or directory. The -R option recursively sets the same ownership of all children directories and files.

Syntax

hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

 

COMMAND: chgrp

The command -chgrp is used to change the group user of a file or directory. The -R option recursively sets the same group for all children directories and files.

Syntax

hadoop fs -chgrp [-R] GROUP URI [URI ...]

COMMAND: copyFromLocal
Syntax

hadoop fs -copyFromLocal <localsrc> URI

The command –copyFromLocal is used to copy files from local file system to hadoop’s distributed file system. Here URI is Hadoop’s absolute path.

 

COMMAND: copyToLocal
Syntax

hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

The command –copyToLocal is used to copy files to local file system from hadoop’s distributed file system.

 

COMMAND: cp
Syntax

hadoop fs -cp [-f] [-p | -p[topax]] URI [URI ...] <dest>

The command –cp is used to copy files between hadoop’s distributed file system locations.

 

The key feature of –cp command in Hadoop shell; is that it allows us to retain the original file’s attributes after copying. For instance file’s attributes; timestamps, ownership, permission, ACL & XAttr are retained on destination file when –p[topax] option is provided in the cp command. There’s also another option –f; forced copy which allows overwriting existing destination file without throwing any errors like it would in case –f option isn’t provided in the command.

 

COMMAND: put
Syntax

hadoop fs -put <localsrc> ... <dst>

The command –put is similar to copyFromLocal command, used to copy files from local to HDFS destination.

 

COMMAND: get
Syntax

hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>

The command –get is similar to copyToLocal command, used to copy files from HDFS to local destination.

 

COMMAND: moveFromLocal
Syntax

hadoop fs -moveFromLocal <localsrc> <dst>

The command –moveFromLocal is used to move files from Local file system to hadoop’s distributed file system targets.

 

COMMAND: moveToLocal
Syntax

hadoop fs -moveToLocal [-crc] <src> <dst>

The command –moveToLocal is used to move files from hadoop’s distributed file system to local drives. As of hadoop 2.7.3 release this command is just a placeholder and isn’t implemented yet. But it’s good to have some idea about what features hadoop is planning to launch in future.

 

COMMAND: mv
Syntax

hadoop fs -mv URI [URI ...] <dest>

The command –mv is used to move files from source HDFS to destination HDFS. This command allows multiple sources as well in which case the destination needs to be a directory location. Moving files across file systems is not permitted.

 

COMMAND: cat
Syntax

hadoop fs -cat URI [URI ...]

The command –cat is used to view the contents of file(s) stored on hadoop’s distributed file system targets and print them on the screen.

 

COMMAND: tail
Syntax

hadoop fs -tail [-f] URI

The command –tail is used to display 1 KiloByte of data from end of the HDFS file. The option -f allows streaming new content in the file.

 

COMMAND: text
Syntax

hadoop fs -text <src>

The command –text is used to review contents of optimized files such as compressed ones on HDFS.

 

COMMAND: checksum
Syntax

hadoop fs -checksum URI

The command –checksum is used to compute checksum value of a file stored on hadoop’s distributed file system.

WHAT IS CHECKSUM

Checksum is basically a number computed by running mathematical operations on digital content of a file. If the content of the files changes or are different from source file, then re-calculation of its checksum will yield a different number and that’s how files can be validated for accuracy after transmission by comparing source file’s checksum and destination file’s checksum.

An important point to understand here is that –checksum command works on HDFS files only and uses different algorithm than what is used on Linux shell. In other words, if you copy a file from local file system to HDFS file system and expect that their checksums will match, then it won’t because Linux’s -cksum command uses a different algorithm to compute checksum value than HDFS system.

You might also like:   Anatomy of Kafka Architecture

HDFS uses MD5-of-0MD5-of-512CRC32C as of current 2.7.3 release while Linux uses its native md5sum method.

But it is still possible to compare the checksum values of a file between Local FS and HDFS using a simple trick. Here’s how you can do it.

hadoop fs -cat /learn/hdfs/admin/compressed.txt.gz | cksum

 

Although this trick works in most cases, but it’s not recommended for large files because you are basically streaming (-cat command) entire contents of the HDFS file and sending it to a Linux shell which will transmit entire data between systems and consume network bandwidth. So, use this approach only for small size files if you have to. If the files are large then it’s better to enhance end application to perform validation before reading the files.

 

COMMAND: appendToFile
Syntax

hadoop fs -appendToFile <localsrc> ... <dst>

The command –appendToFile is used to append new content on existing files stored on HDFS.

 

COMMAND: ls
Syntax

hadoop fs -ls [-d] [-h] [-R] [-t] [-S] [-r] [-u] <args>

The command –ls is used to list the files or directories stored at HDFS.

Options:

  • -d: Directories are listed as plain files.
  • -h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864).
  • -R: Recursively list subdirectory encountered.
  • -t: Sort output by modification time (most recent first).
  • -S: Sort output by file size.
  • -r: Reverse the sort order.
  • -u: Use access time rather than modification time for display and sorting.

 

COMMAND: find
Syntax

hadoop fs -find <path> ... <expression> ...

The command –find is used to locate files on hadoop’s distributed file system.

Many people struggle with using -find command in Hadoop and its very simple actually. Here’s an example.

hadoop fs -find /learn/hdfs -name *hadoop* -print

Here, we want to search all files under /learn/hdfs location; which are named as “hadoop”. The results will be case-sensitive. So, if there are files with “Hadoop” instead of “hadoop” in name, then those won’t be shown in the result.

 

COMMAND: getmerge
Syntax

hadoop fs -getmerge <src> <localdst> [addnl]

The command –getmerge is used to combine multiple input files into a single file at the destination. This command works from HDFS to local drive only and not vice versa or HDFS to HDFS as of current hadoop release.

 

COMMAND: stat
Syntax

hadoop fs -stat [format] <path> ...

The command –stat is used to generate usage reports for hadoop distributed file system.

The optional parameter [format] provides ways to define layout of the desired report. Here’s are various options in [format] parameter.

(%b) – file size in blocks

(%F) – type

(%g) – group name of owner

(%n) – name

(%o) – block size

(%r) – replication

(%u) – user name of owner

(%y or %Y) – modification date. %y shows UTC date as “yyyy-MM-dd HH:mm:ss” and %Y shows milliseconds since January 1, 1970 UTC. If the format is not specified, %y is used by default.

 

This is also less known and used command but it can be greatly useful for administrators. Here’s an example of it.

 

hadoop fs -stat "%F,%u,%g,%b,%r,%y,%n" /learn/hdfs

It will generate a report like this:

 

Type,User,Group,Size,Replication,Modification Date,Item

regular file,root,dev,8497,1,2018-02-04 18:56:56,Hadoop.txt

directory,root,dev,0,0,2018-02-04 20:06:33,data

regular file,root,dev,3872,1,2018-01-30 23:38:21,fs.txt

regular file,root,dev,8497,1,2018-01-30 20:23:04,hadoop.txt

regular file,root,dev,8497,1,2018-01-30 22:50:15,hadoop2.txt

regular file,root,dev,8497,1,2018-01-30 22:51:04,hadoop3.txt

 

COMMAND: count
Syntax

hadoop fs -count [-q] [-h] [-v] <paths>

The command –count is used to count files and directories on hadoop’s distributed file system.

You might also like:   Understanding Dynamic Imports in Python: A Guide with Examples

When used with -q option it provides report in following format by default:

 

QUOTA  REM_QUOTA  SPACE_QUOTA REM_SPACE_QUOTA  DIR_COUNT  FILE_COUNT  CONTENT_SIZE PATHNAME

It is easier to read the file sizes if -h option is used which basically converts bytes to MBs or GBs or TBs.

 

COMMAND: df
Syntax

hadoop fs -df [-h] URI [URI ...]

The command –df is used to display available free space on hadoop distributed file system.

A sample report would look like this

 

Filesystem                                 Size        Used          Available     Use%

hdfs://sandbox-hdp.hortonworks.com:8020  41.6 G        1.6 G      24.1 G           4%

hdfs://sandbox-hdp.hortonworks.com:8020  41.6 G        1.6 G      24.1 G           4%

hdfs://sandbox-hdp.hortonworks.com:8020  41.6 G        1.6 G      24.1 G           4%

COMMAND: du
Syntax

hadoop fs -du [-s] [-h] URI [URI ...]

The command –du is used to generate disk usage statistics for hadoop distributed file system.

The –du command when used with–s option, it will provide aggregated summary of total files in a directory. It’s better to combine the command with –h option to display size column in human readable format.

Here’s a sample report

866 /learn/hdfs/admin

73.9K /learn/hdfs/development

29.5K /learn/hdfs/operations

 

COMMAND: rm
Syntax

hadoop fs -rm [-f] [-r |-R] [-skipTrash] URI [URI ...]

The command –rm is used to delete files or directories from HDFS drives.

Hadoop provides an optional parameter –skipTrash which as name suggests will permanently delete the file and will skip the step of moving it to trash directory. Otherwise all removed files are moved to .Trash directory.

 

COMMAND: expunge
Syntax

hadoop fs -expunge

The command –expunge is used to clean .Trash directory on HDFS drives.

 

COMMAND: test
Syntax

hadoop fs -test -[defsz] URI

The command –test is used to perform various file or directory level operations on HDFS.

Options:

  • -d: f the path is a directory, return 0.
  • -e: if the path exists, return 0.
  • -f: if the path is a file, return 0.
  • -s: if the path is not empty, return 0.
  • -z: if the file is zero length, return 0.

 

COMMAND: touchz
Syntax

hadoop fs -touchz URI [URI ...]

The command –touchz is used to create empty files directly on HDFS locations.

 

COMMAND: truncate
Syntax

hadoop fs -truncate [-w] <length> <paths>

The command –truncate is used to reduce file to a particular size in HDFS.

Hadoop provides an optional parameter –w with –truncate to make process wait for the completion. This is advised for smaller files but not larger files. Also remember that this is a dangerous command because truncated data will not be recoverable and be lost forever. So, be careful when using this command.

COMMAND: setrep
Syntax

hadoop fs -setrep [-R] [-w] <numReplicas> <path>

The command –setrep is used to set replication of a file or directory on HDFS.

When setrep is combined with –w option, it will wait for replication to finish which may take very long time depending upon number & size of files. So, until really necessary avoid using –w command. If replication factor is increased (say from 1 to 3), then Hadoop will create 2 copies of data and transfer them over the network to different data nodes. This may take while if there are too many files that are changing.

TOP PAYING JOBS REQUIRE THIS SKILL

ENROLL AT 90% OFF TODAY

Complete ElasticSearch Integration with LogStash, Hadoop, Hive, Pig, Kibana and MapReduce - DataSharkAcademy

 

This concludes all the shell commands that we wanted to cover in this article.

We hope that this post proves immensely helpful to you and your organization. In this article we have featured Everything you need to know about Hadoop Shell. We believe this will help you in starting your career as big data engineer.

If you liked this Everything you need to know about Hadoop Shell article, then do share it with your colleagues and friends. Do you have any questions or suggestions for us? please leave it in the comments section below.

Do not forget to sign up for our Free newsletter below.

 

GET MORE LIKE THIS
DELIVERED RIGHT TO YOUR MAILBOX
we hate spams too, promise.

 

 


[jetpack-related-posts]

1 Comment

  1. […] If you want to try hands on with Hadoop, then you should check out – Everything you need to know about Hadoop Shell […]

Leave a Reply

Scroll to top