Install Spark, Scala and SBT on a Windows PC in 15 minutes

As you might already be aware of the new hottest technologies in the world today. Yes, if you thought about Apache Spark, Scala and SBT, then you are right. These 3 packages together enable anyone to develop really fast & distributed applications for big data processing. After spending many years playing around these technologies, I often been asked about how to install them on say a local Windows PC. Not everybody can afford opening $2000 on a MacBook !!! So, I decided to write this post about install Spark, Scala, and SBT on a Windows PC in 15 minutes. So, let’s get started.

Pre-requisites

Before we can install these packages, they must be downloaded from respective websites. So, let’s go ahead and download following packages:

1. Spark

For Spark package, choose a pre-built version of Spark instead of source. You can download source files as well but they will need to be built before they can be used. So to keep this article short, I will only cover pre-built version of Spark. Make sure you download the package based on your operating system. Also note that the version might change in future, so any stable version of Spark should be fine.

http://spark.apache.org/downloads.html

For this exercise, I am using Spark 2.1.0

If you are interested in learning spark, then we recommend you to check high performance spark by Holden Karau.

2. Scala

Scala is a functional programming language which Spark supports natively apart from Java and Python. In this article, we will setup Scala programming environment, so download it from below link:

https://www.scala-lang.org/download/

For this exercise, I am using Scala.2.10.5

3. SBT

Short for “Scalable build tool” is similar to Maven but concise and tight. So, go ahead and download a package appropriate for your operating system from below link. SBT is needed in case you are planning to build your own spark applications or build Spark package from its source code.

http://www.scala-sbt.org/download.html

For this exercise, I am using sbt-0.13.13.1

If your Scala application requires some external libraries then I warn you that as of SBT 0.13, it is not easy to compile them in single scala application but it is possible. You can find out how to compile external libraries in a scala application here.

You might also like: PySpark Partitioning by Multiple Columns - A Complete Guide with Examples

4. Java

Apart from above packages, you will also need Java v6 or later installed on your machine and environment variables; JAVA_HOME & PATH set accordingly. You can install latest java version from:

https://java.com/en/download/

For this exercise, I have Java 8+ installed

5. WinUtil.exe

Although Spark doesn’t need Hadoop installed on the machine in order to run but when you run ‘spark-shell’ command (covered later), it will try to locate /tmp/hive directory on HDFS. WinUtil, will help you allocate a windows folder as HDFS directory and surpass the errors thrown by Spark for missing directories, etc. So, download it from below link.

https://github.com/steveloughran/winutils/blob/master/hadoop-2.6.0/bin/winutils.exe

Related posts that you might like to read too:

How to compile unmanaged libraries within a SCALA application – This article is a life saver for those struggling with compiling external libraries in their Spark/Scala applications.
6 Reasons Why Hadoop is THE Best Choice for Big Data Applications – This article explains why Hadoop is the leader in the market and will be one for long long time.
What is MobaXterm and How to install it on your computer for FREE – If you haven’t used MobaXterm before, then I highly recommend you to try it today. I guarantee you will never look back.
Learn ElasticSearch and Build Data Pipelines – This is actually an intro to a very comprehensive course on integrating Hadoop with ElasticSearch. This is one of key skills for advancing your data engineering career.
How to Quickly Setup Apache Hadoop on Windows PC – Step by step instructions on how to set up Hadoop on personal computer.

Setup your working environment

Now we have all the pre-requisite tools/packages downloaded on our machine. Lets place them in an organized manner so our rest of applications or files on the machine aren’t effected. So, please follow below steps:

You might also like: How to compile unmanaged libraries within a SCALA application

1. Create a new folder as:

c:\sandbox

This will be our sandbox folder holding all required installations and packages together.

2. Extract spark-2.1.0-bin-hadoop2.7.tgz file downloaded earlier, using winzip or any equivalent tool.

3. Rename the extracted folder to ‘spark’ instead of ‘spark-2.1.0-bin-hadoop2.7’. This makes it easy to refer this folder without remembering long names. So, your folder should look like this:

4. Move ‘spark’ folder under c:\sandbox directory we created in step 1.

5. Install Scala programming language by clicking on downloaded binary. In my case it is ‘scala-2.10.5.exe’.

6. You can set its location to c:\sandbox\scala or to default location c:\program files(x86)\scala.

7. Install SBT by clicking on downloaded binary. In my case it is ‘sbt-0.13.13.1.msi’

8. Here again, you can set its location to c:\sandbox\sbt or to default location c:\program files(x86)\sbt

9. Create 2 new folders as c:\sandbox\tmp & c:\sandbox\tmp\hive

c:\sandbox\tmp

c:\sandbox\tmp\hive

After you created above folders, your directors should look like:

Directory Structure

10. Now create 2 more new folders as:

c:\sandbox\winutils 

c:\sandbox\winutils\bin

11. Move downloaded winutils.exe file under c:\sandbox\winutils\bin

We are done with organizing the environment !!! Take a walk at this point.

Alright, we are back after a short walk.

Now we have everything placed as we needed. Lets configure environment properties so everything works together.

1. Go to desktop and right-click on “My Computer”. Select “Properties”.

2. In pop up window, select “Advanced System Settings”

3. Then click on “Advanced” tab and “Environment Variables” button.

4. In pop up window called Environment variables, lets add following new variables:

SPARK_HOME = c:\sandbox\spark
PATH = %SPARK_HOME%\bin
SCALA_HOME = c:\sandbox\scala
PATH = %SCALA_HOME%\bin
SBT_HOME = c:\sandbox\sbt
PATH = %SBT_HOME%\bin
HADOOP_HOME = c:\sandbox\winutils
PATH = %HADOOP_HOME%\bin

Make sure PATH variables are appended and use semicolon (;) in between different values. Finally your PATH variable should look something like:

PATH;%SCALA_HOME%\bin;%SBT_HOME%\bin;%SPARK_HOME%\bin;%SPARK_HOME%;%HADOOP_HOME%\bin

We are done with all installations needed and now its time to Test if everything is working !!!

You might also like: Mastering PySpark Window Functions: Cumulative Calculations (Running Totals and Averages)

You should check this book to learn why spark streaming is changing the game.

Testing

Finally you want to test if everything is installed properly and working as expected. Lets first test the Scala as follows:

1. Go to Start Menu and Click on “Run” and then type cmd to open command prompt.

2. Type scala –version on the prompt and you should get an output like below:

If you got so far, it means your scala package is installed and configured correctly.

Now lets test Spark package

1. First set a clean HDFS location for \tmp\hive by typing following on command prompt:

> winutils.exe chmod 777 c:\sandbox\tmp\hive

You shouldn’t get any errors or warnings here.

2. Now run spark shell as:

You might get some warnings which you can ignore safely. Finally you should see a SPARK prompt which confirms that installation worked as expected.

You can also type some scala command and run it on spark as follows:

TOP PAYING JOBS REQUIRE THIS SKILL

ENROLL AT 90% OFF TODAY

Complete ElasticSearch Integration with LogStash, Hadoop, Hive, Pig, Kibana and MapReduce - DataSharkAcademy

Finally open the Spark UI in your browser at http://localhost:4040/ to track your jobs on web portal dashboard.

If you have reached this far, then you are all set to write your own Spark applications, build and run Spark locally on your machine. Enjoy !!!

You might be interested in knowing why cloud computing is one of the most demanded skill as of today and there’s a huge shortage of engineers with AWS certification. We recommend you to check these courses which can help you get AWS certified within no time.

We hope you liked this post. If you have any questions, please ask us in comments below.

5 Comments

How to find bad partitions in hive tables using minimal workMarch 2, 2018

[…] I recently finished a writeup on how to install Spark on Windows PC. You can check it out here […]

Log in to Reply
Learn how to compile external libraries within a scala application.March 2, 2018

[…] If you haven’t installed SBT on your PC yet, then checkout my other post on step by step guide about installing Scala, SBT and Scala. […]

Log in to Reply
How to Quickly Setup Apache Hadoop on Windows PCMarch 9, 2018

[…] you have setup Hadoop on your machine, you can take it further by installing Spark, Scala and SBT too or try your hands on integrating Hadoop with […]

Log in to Reply
The Best Data Processing Architectures: Lambda vs KappaSeptember 17, 2018

[…] Installing Spark – Scala – SBT (S3) on Windows PC – detailed instructions on how to get started with Spark, Scala and SBT for window users. […]

Log in to Reply
How to setup Apache Hadoop Cluster on a Mac or Linux ComputerSeptember 18, 2018

[…] Installing Spark – Scala – SBT (S3) on Windows PC – If you want to learn Spark then this article will help you get started with it. […]

Log in to Reply