Installing Spark – Scala – SBT (S3) on Windows PC

Installing Spark – Scala – SBT (S3) on Windows PC

This post may contain affiliate links. Please read our disclosure for more info.

Installing Spark – Scala – SBT (S3) on Windows PC (blog)As you might already be aware of the new hottest technologies in the world today. Yes, if you thought about Apache Spark, Scala and SBT, then you are right. These 3 packages together enable anyone to develop really fast & distributed applications for big data processing. After spending many years playing around these technologies, I often been asked about how to install them on say a local Windows PC. Not everybody can afford opening $2000 on a MacBook !!! So, I decided to write this post about install Spark, Scala, and SBT on a Windows PC in 15 minutes. So, let’s get started.

Pre-requisites

Before we can install these packages, they must be downloaded from respective websites. So, let’s go ahead and download following packages:

1. Spark

For Spark package, choose a pre-built version of Spark instead of source. You can download source files as well but they will need to be built before they can be used. So to keep this article short, I will only cover pre-built version of Spark. Make sure you download the package based on your operating system. Also note that the version might change in future, so any stable version of Spark should be fine.

http://spark.apache.org/downloads.html

For this exercise, I am using Spark 2.1.0

If you are interested in learning spark, then we recommend you to check high performance spark by Holden Karau.

 

2. Scala

Scala is a functional programming language which Spark supports natively apart from Java and Python. In this article, we will setup Scala programming environment, so download it from below link:

https://www.scala-lang.org/download/

For this exercise, I am using Scala.2.10.5

3. SBT

Short for “Scalable build tool” is similar to Maven but concise and tight. So, go ahead and download a package appropriate for your operating system from below link. SBT is needed in case you are planning to build your own spark applications or build Spark package from its source code.

http://www.scala-sbt.org/download.html

For this exercise, I am using sbt-0.13.13.1

If your Scala application requires some external libraries then I warn you that as of SBT 0.13, it is not easy to compile them in single scala application but it is possible. You can find out how to compile external libraries in a scala application here.

You might also like:   Spark Streaming with Kafka

4. Java

Apart from above packages, you will also need Java v6 or later installed on your machine and environment variables; JAVA_HOME & PATH set accordingly. You can install latest java version from:

https://java.com/en/download/

For this exercise, I have Java 8+ installed

5. WinUtil.exe

Although Spark doesn’t need Hadoop installed on the machine in order to run but when you run ‘spark-shell’ command (covered later), it will try to locate /tmp/hive directory on HDFS. WinUtil, will help you allocate a windows folder as HDFS directory and surpass the errors thrown by Spark for missing directories, etc. So, download it from below link.

https://github.com/steveloughran/winutils/blob/master/hadoop-2.6.0/bin/winutils.exe

 

Related posts that you might like to read too:

Data Analysis with Spark Using Python - DataShark Academy

Setup your working environment

Now we have all the pre-requisite tools/packages downloaded on our machine. Lets place them in an organized manner so our rest of applications or files on the machine aren’t effected. So, please follow below steps:

You might also like:   Benchmarking in Python: Techniques and Best Practices for Performance Evaluation

1. Create a new folder as:

c:\sandbox

This will be our sandbox folder holding all required installations and packages together.

2. Extract spark-2.1.0-bin-hadoop2.7.tgz file downloaded earlier, using winzip or any equivalent tool.

3. Rename the extracted folder to ‘spark’ instead of ‘spark-2.1.0-bin-hadoop2.7’. This makes it easy to refer this folder without remembering long names. So, your folder should look like this:

 Directory structure

4. Move ‘spark’ folder under c:\sandbox directory we created in step 1.

5. Install Scala programming language by clicking on downloaded binary. In my case it is ‘scala-2.10.5.exe’.

6. You can set its location to c:\sandbox\scala or to default location c:\program files(x86)\scala.

7. Install SBT by clicking on downloaded binary. In my case it is ‘sbt-0.13.13.1.msi
GET MORE LIKE THIS
DELIVERED RIGHT TO YOUR MAILBOX
we hate spams too, promise.

8. Here again, you can set its location to c:\sandbox\sbt or to default location c:\program files(x86)\sbt

9. Create 2 new folders as c:\sandbox\tmp & c:\sandbox\tmp\hive

c:\sandbox\tmp

c:\sandbox\tmp\hive

After you created above folders, your directors should look like:

Directory Structure

10. Now create 2 more new folders as:

c:\sandbox\winutils 

c:\sandbox\winutils\bin
Directory Structure

11. Move downloaded winutils.exe file under c:\sandbox\winutils\bin

We are done with organizing the environment !!! Take a walk at this point.

Complete ElasticSearch Integration with LogStash, Hadoop, Hive, Pig, Kibana and MapReduce - DataSharkAcademy

Alright, we are back after a short walk.

Now we have everything placed as we needed. Lets configure environment properties so everything works together.

1. Go to desktop and right-click on “My Computer”. Select “Properties”.

2. In pop up window, select “Advanced System Settings

Advanced System Settings

3. Then click on “Advanced” tab and “Environment Variables” button.

Environment Settings

4. In pop up window called Environment variables, lets add following new variables:

SPARK_HOME = c:\sandbox\spark
PATH = %SPARK_HOME%\bin
SCALA_HOME = c:\sandbox\scala
PATH = %SCALA_HOME%\bin
SBT_HOME = c:\sandbox\sbt
PATH = %SBT_HOME%\bin
HADOOP_HOME = c:\sandbox\winutils
PATH = %HADOOP_HOME%\bin

Make sure PATH variables are appended and use semicolon (;) in between different values. Finally your PATH variable should look something like:

PATH;%SCALA_HOME%\bin;%SBT_HOME%\bin;%SPARK_HOME%\bin;%SPARK_HOME%;%HADOOP_HOME%\bin

We are done with all installations needed and now its time to Test if everything is working !!!

You might also like:   Why Large number of files on Hadoop is a problem and how to fix it?

You should check this book to learn why spark streaming is changing the game.

Testing

Finally you want to test if everything is installed properly and working as expected. Lets first test the Scala as follows:

1. Go to Start Menu and Click on “Run” and then type cmd to open command prompt.

2. Type scala –version on the prompt and you should get an output like below:

If you got so far, it means your scala package is installed and configured correctly.

Now lets test Spark package

1. First set a clean HDFS location for \tmp\hive by typing following on command prompt:

> winutils.exe chmod 777 c:\sandbox\tmp\hive

You shouldn’t get any errors or warnings here.

2. Now run spark shell as:

You might get some warnings which you can ignore safely. Finally you should see a SPARK prompt which confirms that installation worked as expected.

You can also type some scala command and run it on spark as follows:

TOP PAYING JOBS REQUIRE THIS SKILL

ENROLL AT 90% OFF TODAY

Complete ElasticSearch Integration with LogStash, Hadoop, Hive, Pig, Kibana and MapReduce - DataSharkAcademy

Finally open the Spark UI in your browser at http://localhost:4040/ to track your jobs on web portal dashboard.

If you have reached this far, then you are all set to write your own Spark applications, build and run Spark locally on your machine. Enjoy !!!

You might be interested in knowing why cloud computing is one of the most demanded skill as of today and there’s a huge shortage of engineers with AWS certification. We recommend you to check these courses which can help you get AWS certified within no time.

 

AWS Certified Developer Associate - Practice Test 2018-DataShark.Academy

 

AWS Certified Solution Architect - Practice Test 2018-DataShark.Academy

We hope you liked this post. If you have any questions, please ask us in comments below.


[jetpack-related-posts]

5 Comments

  1. […] I recently finished a writeup on how to install Spark on Windows PC. You can check it out here […]

  2. […] If you haven’t installed SBT on your PC yet, then checkout my other post on step by step guide about installing Scala, SBT and Scala. […]

  3. […] you have setup Hadoop on your machine, you can take it further by installing Spark, Scala and SBT too or try your hands on integrating Hadoop with […]

  4. […] Installing Spark – Scala – SBT (S3) on Windows PC – detailed instructions on how to get started with Spark, Scala and SBT for window users. […]

  5. […] Installing Spark – Scala – SBT (S3) on Windows PC – If you want to learn Spark then this article will help you get started with it. […]

Leave a Reply

Scroll to top