As you might already be aware of the new hottest technologies in the world today. Yes, if you thought about Apache Spark, Scala and SBT, then you are right. These 3 packages together enable anyone to develop really fast & distributed applications for big data processing. After spending many years playing around these technologies, I often been asked about how to install them on say a local Windows PC. Not everybody can afford opening $2000 on a MacBook !!! So, I decided to write this post about install Spark, Scala, and SBT on a Windows PC in 15 minutes. So, let’s get started.
Pre-requisites
Before we can install these packages, they must be downloaded from respective websites. So, let’s go ahead and download following packages:
1. Spark
For Spark package, choose a pre-built version of Spark instead of source. You can download source files as well but they will need to be built before they can be used. So to keep this article short, I will only cover pre-built version of Spark. Make sure you download the package based on your operating system. Also note that the version might change in future, so any stable version of Spark should be fine.
http://spark.apache.org/downloads.html
For this exercise, I am using Spark 2.1.0
If you are interested in learning spark, then we recommend you to check high performance spark by Holden Karau.
2. Scala
Scala is a functional programming language which Spark supports natively apart from Java and Python. In this article, we will setup Scala programming environment, so download it from below link:
https://www.scala-lang.org/download/
For this exercise, I am using Scala.2.10.5
3. SBT
Short for “Scalable build tool” is similar to Maven but concise and tight. So, go ahead and download a package appropriate for your operating system from below link. SBT is needed in case you are planning to build your own spark applications or build Spark package from its source code.
http://www.scala-sbt.org/download.html
For this exercise, I am using sbt-0.13.13.1
If your Scala application requires some external libraries then I warn you that as of SBT 0.13, it is not easy to compile them in single scala application but it is possible. You can find out how to compile external libraries in a scala application here.
4. Java
Apart from above packages, you will also need Java v6 or later installed on your machine and environment variables; JAVA_HOME & PATH set accordingly. You can install latest java version from:
For this exercise, I have Java 8+ installed
5. WinUtil.exe
Although Spark doesn’t need Hadoop installed on the machine in order to run but when you run ‘spark-shell’ command (covered later), it will try to locate /tmp/hive directory on HDFS. WinUtil, will help you allocate a windows folder as HDFS directory and surpass the errors thrown by Spark for missing directories, etc. So, download it from below link.
https://github.com/steveloughran/winutils/blob/master/hadoop-2.6.0/bin/winutils.exe
Related posts that you might like to read too:
- How to compile unmanaged libraries within a SCALA application – This article is a life saver for those struggling with compiling external libraries in their Spark/Scala applications.
- 6 Reasons Why Hadoop is THE Best Choice for Big Data Applications – This article explains why Hadoop is the leader in the market and will be one for long long time.
- What is MobaXterm and How to install it on your computer for FREE – If you haven’t used MobaXterm before, then I highly recommend you to try it today. I guarantee you will never look back.
- Learn ElasticSearch and Build Data Pipelines – This is actually an intro to a very comprehensive course on integrating Hadoop with ElasticSearch. This is one of key skills for advancing your data engineering career.
- How to Quickly Setup Apache Hadoop on Windows PC – Step by step instructions on how to set up Hadoop on personal computer.
Setup your working environment
Now we have all the pre-requisite tools/packages downloaded on our machine. Lets place them in an organized manner so our rest of applications or files on the machine aren’t effected. So, please follow below steps:
1. Create a new folder as:
c:\sandbox
This will be our sandbox folder holding all required installations and packages together.
2. Extract spark-2.1.0-bin-hadoop2.7.tgz file downloaded earlier, using winzip or any equivalent tool.
3. Rename the extracted folder to ‘spark’ instead of ‘spark-2.1.0-bin-hadoop2.7’. This makes it easy to refer this folder without remembering long names. So, your folder should look like this:
4. Move ‘spark’ folder under c:\sandbox directory we created in step 1.
5. Install Scala programming language by clicking on downloaded binary. In my case it is ‘scala-2.10.5.exe’.
6. You can set its location to c:\sandbox\scala or to default location c:\program files(x86)\scala.
8. Here again, you can set its location to c:\sandbox\sbt or to default location c:\program files(x86)\sbt
9. Create 2 new folders as c:\sandbox\tmp & c:\sandbox\tmp\hive
c:\sandbox\tmp
c:\sandbox\tmp\hive
After you created above folders, your directors should look like:
10. Now create 2 more new folders as:
c:\sandbox\winutils
c:\sandbox\winutils\bin
11. Move downloaded winutils.exe file under c:\sandbox\winutils\bin
We are done with organizing the environment !!! Take a walk at this point.
Alright, we are back after a short walk.
Now we have everything placed as we needed. Lets configure environment properties so everything works together.
1. Go to desktop and right-click on “My Computer”. Select “Properties”.
2. In pop up window, select “Advanced System Settings”
3. Then click on “Advanced” tab and “Environment Variables” button.
4. In pop up window called Environment variables, lets add following new variables:
SPARK_HOME = c:\sandbox\spark PATH = %SPARK_HOME%\bin SCALA_HOME = c:\sandbox\scala PATH = %SCALA_HOME%\bin SBT_HOME = c:\sandbox\sbt PATH = %SBT_HOME%\bin HADOOP_HOME = c:\sandbox\winutils PATH = %HADOOP_HOME%\bin
Make sure PATH variables are appended and use semicolon (;) in between different values. Finally your PATH variable should look something like:
PATH;%SCALA_HOME%\bin;%SBT_HOME%\bin;%SPARK_HOME%\bin;%SPARK_HOME%;%HADOOP_HOME%\bin
We are done with all installations needed and now its time to Test if everything is working !!!
You should check this book to learn why spark streaming is changing the game.
Testing
Finally you want to test if everything is installed properly and working as expected. Lets first test the Scala as follows:
1. Go to Start Menu and Click on “Run” and then type cmd to open command prompt.
2. Type scala –version on the prompt and you should get an output like below:
If you got so far, it means your scala package is installed and configured correctly.
Now lets test Spark package
1. First set a clean HDFS location for \tmp\hive by typing following on command prompt:
> winutils.exe chmod 777 c:\sandbox\tmp\hive
You shouldn’t get any errors or warnings here.
2. Now run spark shell as:
You might get some warnings which you can ignore safely. Finally you should see a SPARK prompt which confirms that installation worked as expected.
You can also type some scala command and run it on spark as follows:
Finally open the Spark UI in your browser at http://localhost:4040/ to track your jobs on web portal dashboard.
If you have reached this far, then you are all set to write your own Spark applications, build and run Spark locally on your machine. Enjoy !!!
You might be interested in knowing why cloud computing is one of the most demanded skill as of today and there’s a huge shortage of engineers with AWS certification. We recommend you to check these courses which can help you get AWS certified within no time.
We hope you liked this post. If you have any questions, please ask us in comments below.
[…] I recently finished a writeup on how to install Spark on Windows PC. You can check it out here […]
[…] If you haven’t installed SBT on your PC yet, then checkout my other post on step by step guide about installing Scala, SBT and Scala. […]
[…] you have setup Hadoop on your machine, you can take it further by installing Spark, Scala and SBT too or try your hands on integrating Hadoop with […]
[…] Installing Spark – Scala – SBT (S3) on Windows PC – detailed instructions on how to get started with Spark, Scala and SBT for window users. […]
[…] Installing Spark – Scala – SBT (S3) on Windows PC – If you want to learn Spark then this article will help you get started with it. […]