Hadoop Shell is a Linux like terminal utility that can be used to interact with Hadoop’s distributed file system. For Linux users it will be a familiar interface and functionality with only difference in behind the scene operations that are performed. Hadoop shell runs every command on a distributed cluster instead of a single computer. Whether we have set up Hadoop on our single machine; a laptop or desktop or Macbook, the core functionality of Hadoop will remain same across operating systems and so will be the user experience.
If you haven’t setup apache Hadoop on your computer yet, then you should checkout our related posts How to Quickly Setup Apache Hadoop on Windows PC for windows users & How to setup Apache Hadoop Cluster on a Mac or Linux Computer for our MacBook or Linux users.
Quick side note, here is a list of related posts that we recommend you to read:
- 6 Reasons Why Hadoop is THE Best Choice for Big Data Applications – This article explains why Hadoop is the leader in the market and will be one for long time.
- Why Large number of files on Hadoop is a problem and how to fix it? – This is highly recommended for anyone working on Hadoop or looking to work on it in future.
- Integrate ElasticSearch with Hadoop Technologies – This is actually an intro to a very comprehensive course on integrating Hadoop with ElasticSearch. This is one of the key skills for advancing your data engineering career today.
- How to Quickly Setup Apache Hadoop on Windows PC – If you are a windows PC user and landed on this page, then we have detailed instructions on how to install Apache Hadoop on Windows PC in this post.
- Installing Spark – Scala – SBT (S3) on Windows PC – If you want to learn Spark then this article will help you get started with it.
Next we will dig deeper into various commands that Hadoop shell offers. As of this article, we have used Hadoop 2.7.3 version for list of featured commands. Again, remember that each of Hadoop shell commands run on distributed network of machines and not a single machine. Even if you have installed Hadoop on your personal computer, internally Hadoop will run commands in much the same way it would do on a Hadoop cluster. If you are new to Hadoop, we highly recommend you to check above suggested articles (especially the ones with Hadoop in title).
We would like to recommend you this book which is very good for beginners.
Let’s move to specific commands in Hadoop Shell. This lesson will be more effective if you try these commands in parallel. If you haven’t setup Hadoop yet or need help in setting it up on your personal computer, then please check above suggested step by step guides for Mac, Linux or Windows PCs.
If you want to learn more about how Hadoop works, then we recommend you to check: Hadoop – The Definitive Guide
Getting Help in Hadoop Shell
There are many different releases of Hadoop available which you can find here. As Hadoop’s community keeps adding new features regularly, so it’s advised to look at recent release details to get familiar with latest updates and commands.
The most common operation in any program is to understand how to get help when needed. Hadoop shell also comes with a help feature. There are two ways to get help about a command in Hadoop Shell:
- using help command
- using usage command
In its simplest form, help command can provide us entire help documentation of Hadoop Shell. It will print the entire list of commands available with all optional parameters supported by those commands.
COMMAND: help
Syntax hadoop fs –help
If we do not prefer to dump entire help documentation, but interested only in specific commands, then run it as
$> $ hadoop fs -help ls
COMMAND: usage
Syntax hadoop fs -usage <command>
Let’s try it in terminal as
$> hadoop fs -usage ls
COMMAND: mkdir
mkdir command is similar to Linux shell mkdir command, which can be used to create specific directories on hadoop file system.
Syntax
hadoop fs -mkdir [-p] <paths>
Where <paths> is the absolute path of one or more HDFS directories that we would like to create. The -p option enables creating parent and children directories in single command.
COMMAND: chmod
The command -chmod is used to change the permissions of a file or directory. The -R option recursively sets the same permissions to all children directories and files.
Syntax hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]
COMMAND: chown
The command -chown is used to change the ownership of a file or directory. The -R option recursively sets the same ownership of all children directories and files.
Syntax hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
COMMAND: chgrp
The command -chgrp is used to change the group user of a file or directory. The -R option recursively sets the same group for all children directories and files.
Syntax hadoop fs -chgrp [-R] GROUP URI [URI ...]
COMMAND: copyFromLocal
Syntax hadoop fs -copyFromLocal <localsrc> URI
The command –copyFromLocal is used to copy files from local file system to hadoop’s distributed file system. Here URI is Hadoop’s absolute path.
COMMAND: copyToLocal
Syntax hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
The command –copyToLocal is used to copy files to local file system from hadoop’s distributed file system.
COMMAND: cp
Syntax hadoop fs -cp [-f] [-p | -p[topax]] URI [URI ...] <dest>
The command –cp is used to copy files between hadoop’s distributed file system locations.
The key feature of –cp command in Hadoop shell; is that it allows us to retain the original file’s attributes after copying. For instance file’s attributes; timestamps, ownership, permission, ACL & XAttr are retained on destination file when –p[topax] option is provided in the cp command. There’s also another option –f; forced copy which allows overwriting existing destination file without throwing any errors like it would in case –f option isn’t provided in the command.
COMMAND: put
Syntax hadoop fs -put <localsrc> ... <dst>
The command –put is similar to copyFromLocal command, used to copy files from local to HDFS destination.
COMMAND: get
Syntax hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>
The command –get is similar to copyToLocal command, used to copy files from HDFS to local destination.
COMMAND: moveFromLocal
Syntax hadoop fs -moveFromLocal <localsrc> <dst>
The command –moveFromLocal is used to move files from Local file system to hadoop’s distributed file system targets.
COMMAND: moveToLocal
Syntax hadoop fs -moveToLocal [-crc] <src> <dst>
The command –moveToLocal is used to move files from hadoop’s distributed file system to local drives. As of hadoop 2.7.3 release this command is just a placeholder and isn’t implemented yet. But it’s good to have some idea about what features hadoop is planning to launch in future.
COMMAND: mv
Syntax hadoop fs -mv URI [URI ...] <dest>
The command –mv is used to move files from source HDFS to destination HDFS. This command allows multiple sources as well in which case the destination needs to be a directory location. Moving files across file systems is not permitted.
COMMAND: cat
Syntax hadoop fs -cat URI [URI ...]
The command –cat is used to view the contents of file(s) stored on hadoop’s distributed file system targets and print them on the screen.
COMMAND: tail
Syntax hadoop fs -tail [-f] URI
The command –tail is used to display 1 KiloByte of data from end of the HDFS file. The option -f allows streaming new content in the file.
COMMAND: text
Syntax hadoop fs -text <src>
The command –text is used to review contents of optimized files such as compressed ones on HDFS.
COMMAND: checksum
Syntax hadoop fs -checksum URI
The command –checksum is used to compute checksum value of a file stored on hadoop’s distributed file system.
WHAT IS CHECKSUM
Checksum is basically a number computed by running mathematical operations on digital content of a file. If the content of the files changes or are different from source file, then re-calculation of its checksum will yield a different number and that’s how files can be validated for accuracy after transmission by comparing source file’s checksum and destination file’s checksum.
An important point to understand here is that –checksum command works on HDFS files only and uses different algorithm than what is used on Linux shell. In other words, if you copy a file from local file system to HDFS file system and expect that their checksums will match, then it won’t because Linux’s -cksum command uses a different algorithm to compute checksum value than HDFS system.
HDFS uses MD5-of-0MD5-of-512CRC32C as of current 2.7.3 release while Linux uses its native md5sum method.
But it is still possible to compare the checksum values of a file between Local FS and HDFS using a simple trick. Here’s how you can do it.
hadoop fs -cat /learn/hdfs/admin/compressed.txt.gz | cksum
Although this trick works in most cases, but it’s not recommended for large files because you are basically streaming (-cat command) entire contents of the HDFS file and sending it to a Linux shell which will transmit entire data between systems and consume network bandwidth. So, use this approach only for small size files if you have to. If the files are large then it’s better to enhance end application to perform validation before reading the files.
COMMAND: appendToFile
Syntax hadoop fs -appendToFile <localsrc> ... <dst>
The command –appendToFile is used to append new content on existing files stored on HDFS.
COMMAND: ls
Syntax hadoop fs -ls [-d] [-h] [-R] [-t] [-S] [-r] [-u] <args>
The command –ls is used to list the files or directories stored at HDFS.
Options:
- -d: Directories are listed as plain files.
- -h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864).
- -R: Recursively list subdirectory encountered.
- -t: Sort output by modification time (most recent first).
- -S: Sort output by file size.
- -r: Reverse the sort order.
- -u: Use access time rather than modification time for display and sorting.
COMMAND: find
Syntax hadoop fs -find <path> ... <expression> ...
The command –find is used to locate files on hadoop’s distributed file system.
Many people struggle with using -find command in Hadoop and its very simple actually. Here’s an example.
hadoop fs -find /learn/hdfs -name *hadoop* -print
Here, we want to search all files under /learn/hdfs location; which are named as “hadoop”. The results will be case-sensitive. So, if there are files with “Hadoop” instead of “hadoop” in name, then those won’t be shown in the result.
COMMAND: getmerge
Syntax hadoop fs -getmerge <src> <localdst> [addnl]
The command –getmerge is used to combine multiple input files into a single file at the destination. This command works from HDFS to local drive only and not vice versa or HDFS to HDFS as of current hadoop release.
COMMAND: stat
Syntax hadoop fs -stat [format] <path> ...
The command –stat is used to generate usage reports for hadoop distributed file system.
The optional parameter [format] provides ways to define layout of the desired report. Here’s are various options in [format] parameter.
(%b) – file size in blocks
(%F) – type
(%g) – group name of owner
(%n) – name
(%o) – block size
(%r) – replication
(%u) – user name of owner
(%y or %Y) – modification date. %y shows UTC date as “yyyy-MM-dd HH:mm:ss” and %Y shows milliseconds since January 1, 1970 UTC. If the format is not specified, %y is used by default.
This is also less known and used command but it can be greatly useful for administrators. Here’s an example of it.
hadoop fs -stat "%F,%u,%g,%b,%r,%y,%n" /learn/hdfs
It will generate a report like this:
Type,User,Group,Size,Replication,Modification Date,Item regular file,root,dev,8497,1,2018-02-04 18:56:56,Hadoop.txt directory,root,dev,0,0,2018-02-04 20:06:33,data regular file,root,dev,3872,1,2018-01-30 23:38:21,fs.txt regular file,root,dev,8497,1,2018-01-30 20:23:04,hadoop.txt regular file,root,dev,8497,1,2018-01-30 22:50:15,hadoop2.txt regular file,root,dev,8497,1,2018-01-30 22:51:04,hadoop3.txt
COMMAND: count
Syntax hadoop fs -count [-q] [-h] [-v] <paths>
The command –count is used to count files and directories on hadoop’s distributed file system.
When used with -q option it provides report in following format by default:
QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
It is easier to read the file sizes if -h option is used which basically converts bytes to MBs or GBs or TBs.
COMMAND: df
Syntax hadoop fs -df [-h] URI [URI ...]
The command –df is used to display available free space on hadoop distributed file system.
A sample report would look like this
Filesystem Size Used Available Use% hdfs://sandbox-hdp.hortonworks.com:8020 41.6 G 1.6 G 24.1 G 4% hdfs://sandbox-hdp.hortonworks.com:8020 41.6 G 1.6 G 24.1 G 4% hdfs://sandbox-hdp.hortonworks.com:8020 41.6 G 1.6 G 24.1 G 4%
COMMAND: du
Syntax hadoop fs -du [-s] [-h] URI [URI ...]
The command –du is used to generate disk usage statistics for hadoop distributed file system.
The –du command when used with–s option, it will provide aggregated summary of total files in a directory. It’s better to combine the command with –h option to display size column in human readable format.
Here’s a sample report
866 /learn/hdfs/admin 73.9K /learn/hdfs/development 29.5K /learn/hdfs/operations
COMMAND: rm
Syntax hadoop fs -rm [-f] [-r |-R] [-skipTrash] URI [URI ...]
The command –rm is used to delete files or directories from HDFS drives.
Hadoop provides an optional parameter –skipTrash which as name suggests will permanently delete the file and will skip the step of moving it to trash directory. Otherwise all removed files are moved to .Trash directory.
COMMAND: expunge
Syntax hadoop fs -expunge
The command –expunge is used to clean .Trash directory on HDFS drives.
COMMAND: test
Syntax hadoop fs -test -[defsz] URI
The command –test is used to perform various file or directory level operations on HDFS.
Options:
- -d: f the path is a directory, return 0.
- -e: if the path exists, return 0.
- -f: if the path is a file, return 0.
- -s: if the path is not empty, return 0.
- -z: if the file is zero length, return 0.
COMMAND: touchz
Syntax hadoop fs -touchz URI [URI ...]
The command –touchz is used to create empty files directly on HDFS locations.
COMMAND: truncate
Syntax hadoop fs -truncate [-w] <length> <paths>
The command –truncate is used to reduce file to a particular size in HDFS.
Hadoop provides an optional parameter –w with –truncate to make process wait for the completion. This is advised for smaller files but not larger files. Also remember that this is a dangerous command because truncated data will not be recoverable and be lost forever. So, be careful when using this command.
COMMAND: setrep
Syntax hadoop fs -setrep [-R] [-w] <numReplicas> <path>
The command –setrep is used to set replication of a file or directory on HDFS.
When setrep is combined with –w option, it will wait for replication to finish which may take very long time depending upon number & size of files. So, until really necessary avoid using –w command. If replication factor is increased (say from 1 to 3), then Hadoop will create 2 copies of data and transfer them over the network to different data nodes. This may take while if there are too many files that are changing.
TOP PAYING JOBS REQUIRE THIS SKILL
ENROLL AT 90% OFF TODAY
This concludes all the shell commands that we wanted to cover in this article.
We hope that this post proves immensely helpful to you and your organization. In this article we have featured Everything you need to know about Hadoop Shell. We believe this will help you in starting your career as big data engineer.
If you liked this Everything you need to know about Hadoop Shell article, then do share it with your colleagues and friends. Do you have any questions or suggestions for us? please leave it in the comments section below.
Do not forget to sign up for our Free newsletter below.
[…] If you want to try hands on with Hadoop, then you should check out – Everything you need to know about Hadoop Shell […]