Sqoop Import Overwrite existing table - DataShark.Academy

Sqoop Import Overwrite Table

This post may contain affiliate links. Please read our disclosure for more info.

This solution is part of our Apache Sqoop tutorial. In this exercise, you will learn about how to run sqoop import command and overwrite existing data in Hadoop directory.

Often you will have to run same table load multiple times. By default, sqoop will fail such executions as sqoop doesn’t allow overwriting an existing directory in Hadoop.

Sqoop Import Overwrite existing table - DataShark.Academy

But there’s a way to overcome this and replace the existing data. For this, we can use sqoop directive –delete-target-dir with target-dir parameter.

This will trigger a delete HDFS directory operation before sqoop import process.

Here’s a word of caution. We recommend you to be very careful when using –delete-target-dir directive in your code, as it will wipe out any existing data which you may not be able to recover later.

We recommend you to read more about Hadoop our post – 6 Reasons Why Hadoop is THE Best Choice for Big Data Applications.

Why you must learn about Sqoop?

Apache Sqoop architecture is designed to import data from relational databases such as Oracle, MySQL, etc to Hadoop systems. Hadoop is ideal for batch processing of huge amounts of data. It is industry standard nowadays.

In real world scenarios, using sqoop you can transfer the data from relational tables into Hadoop and then leverage the parallel processing capabilities of Hadoop to process huge amounts of data and generate meaningful data insights. The results of Hadoop processing can again be stored back to relational tables using sqoop export functionality.

As sqoop is one of the integral part of Big data hadoop systems, it is often asked in big data job interviews. This post walks you through a sqoop import example which will help you with those sqoop interview questions.

You might also like:   Apache Sqoop import append table

 

How to install Hadoop

Before proceeding with this exercise, you must have Hadoop installed on your machine.

If you are using a Macbook/Linux computer then refer this step by step guide on installing Hadoop on a Macbook.

In case you are a windows PC user, then refer this guide to install Apache Hadoop on your windows PC.

Assuming that you have already installed Apache Hadoop on your computer.

Learn Hadoop- The Definitive Guide - DataShark Academy

Next you need to do 2 more installations before you are ready to run the sqoop import commands:

  1. Install MySQL – For this you can perform mysql download first and then follow instructions to install MySQL on windows or Macbook.
  2. Sqoop Installation – Here as well, first do sqoop download and then follow instructions to install sqoop on windows or Macbook.

The great news is that if you have followed the steps from our guide to install Apache Hadoop on Windows PC, then you don’t need to do either of this aka install MySQL or Sqoop installation. They come pre-built into Hortonworks Data Platform Sandbox package which makes life much easier.

But if you are a Macbook or Linux machine user, then you can use brew utility to install MySQL and Sqoop very easily too. Here are the steps for Mac/Linux users:

Install MySQL

In order to install mysql on a Macbook or Linux machine, just run below command on terminal:

brew install mysql

It will automatically download latest release of MySQL including all dependencies with it and install on your computer.

 

Sqoop Installation

In order to install sqoop on Macbook or Linux machines, just run below command on terminal prompt:

brew install sqoop

This will automatically download sqoop from internet and install it on your machine.

You might also like:   How to use Sqoop Option File

 

Sqoop Import using target-dir

After following all installation steps above, you are ready to proceed with sqoop import process. In this exercise, we will use target-dir directive from apache sqoop. Using sqoop’s target-dir, we can import all records from a MySQL table into flat files on Hadoop’s distributed file system (HDFS) in a specific directory.

We will also add another sqoop directive – delete-target-dir to clean any existing data before sqoop import is invoked.

Let’s look at the command first and we will go through it in detail next

 

sqoop import \
 --connect jdbc:mysql://sandbox-hdp.hortonworks.com/sqoop \
 --username root \
 --password hadoop \
 --table employees \
 --target-dir /sqoop/employees \
 --driver com.mysql.jdbc.Driver \
 --delete-target-dir \

How sqoop import command works

In above sqoop import command, we are importing all records from ’employees’ table in MySQL database called ‘sqoop’

–connect

To connect to database, we used –connect directive followed by JDBC connection string to mysql database.

--connect jdbc:mysql://sandbox-hdp.hortonworks.com/sqoop

For Mac/Linux users, the mysql connection string will look something like:

--connect jdbc:mysql://127.0.0.1:3306/sqoop

–username

This is username for MySQL database

–password

This is password for MySQL database

–table

This is the table that you want to import from MySQL database. In our case, we are importing ’employees’ table.

–target-dir

This is target directory on hadoop distributed file system (HDFS) where you would like the data be copied to. You can pass in full HDFS URI or just the absolute path from root directory on HDFS.

In our case it is set to /sqoop/employees

–driver

This directive specifies the driver to be used for making JDBC connection with MySQL database.

You might also like:   Sqoop Import Controlling Parallelism

You can set it to – com.mysql.jdbc.Driver

WANT TO ADVANCE YOUR CAREER?

Enroll in Master Apache SQOOP complete course today for just $20 (a $200 value)

Only limited seats. Don’t miss this opportunity!!!

 

Mastering Apache Sqoop with Hortonworks Sandbox, Hadoo, Hive & MySQL - DataShark.Academy

Get-Started-20---DataShark.Academy

 

delete-target-dir

This directive will delete any existing data in target directory before starting the sqoop import process.

P.S  Be very careful while using this directive as once data is deleted, it may not be recoverable based on hadoop cluster setup.

 

Run Sqoop Command

In order to run this command, open the terminal on your computer and paste above sqoop import command into it and press enter. When you do that, Sqoop will run a clean up command first to delete any existing data in target directory on HDFS. Once that is completed, apache sqoop will launch 4 map tasks by default. Each map task will create one output file on HDFS. Overall data in MySQL table – employees will be distributed equally among 4 map tasks.

 

We hope you enjoyed this sqoop lesson on how to run sqoop import command to copy data from a MySQL table & overwrite any existing data inside target directory on Hadoop Distributed File System (HDFS).

 


[jetpack-related-posts]

Leave a Reply

Scroll to top