This guide is for beginners who are trying to install Apache Spark on a Windows machine, I will assume that you have a 64-bit windows version and you already know how to add environment variables on Windows.
Note: you don’t need any prior knowledge of the Spark framework to follow this guide.
1. Install Java
First, we need to install Java to execute Spark applications, note that you don’t need to install the JDK if you want just to execute Spark applications and won’t develop new ones using Java. However, in this guide we will install JDK.
To do that go to this page and download the latest version of the JDK. After you install it, add the JAVA_HOME variable to your System Variables and make sure that it’s path value is pointing to JDK parent folder (see figure bellow for demonstration)
After you add this variable, it’s time to modify the Path system variable and add a new entry like this: %JAVA_HOME%\bin. This will let Windows command line recognize Java commands
Now, start the Command Line and type:
java -version
to check if Java was correctly installed
2. Install Scala
Download Scala windows installer from this page, scroll down to “Other resources” section and download the MSI file for windows (see figure bellow). Install it and add a new variable to your System Variables named SCALA_HOME which will point to the parent folder of Scala. Then, add %SCALA_HOME %\bin to Path system variable
3. Spark binaries
Since it’s not easy to build Spark from sources, we will download a pre-built package that contains all Spark binaries needed to execute it. Go to this page and choose the latest stable version pre-built for Hadoop 2.7 and later (see figure bellow). Extract the compressed file in any location you choose and make sure that the path to this location doesn’t contain any spaces. I suggest you to place Spark folder directly into a partition (C: for example).
Add a new variable to you system variables and name it: SPARK_HOME. This variable holds the Spark parent directory path (C:\spark-2.2.0-bin-hadoop2.7 for example). After that, add %SPARK_HOME%\bin to Path system variable
4. Hadoop WinUtils
Since we are using a pre-built Spark binaries for Hadoop, we need also additional binary files to run it. To do that. Create a new folder and name it “WinUtils” and place it in a parent directory of any partition (C:\WinUtils for example), then, go to this page and download this repository by clicking in the right green button and choosing Download ZIP option. After you download the zip file, extract it and copy files from this folder “hadoop-2.7.1” to WinUtils folder (don’t copy the whole directory, just its content, the bin folder).
Note that you can use another folder/location for Hadoop Windows Binaries, but to simplify things and organize the work we used this method.
Now, add HADOOP_HOME variable to your system variables and make it points to the WinUtils folder (C:\WinUtils in this case).
Note: make sure that the variables we added before point to parent directories and not to bin folders !
5. Run Spark shell
Run Command Line as an administrator and type:
spark-shell
If things work well, you will end up with an output like this:
6. Run a sample Spark application
Spark comes with various examples that you can run them directly from the command line using this command:
run-example
Let’s run a sample app that computes an approximation value to pi:
run-example SparkPi
run-example SparkPi 10
We have tested two times: the first one will use the default number of partitions (2) and the second one will use 10
Results:
That’s all, thank you for reading this post and hope this simple guide will help you to install Apache Spark on your own Windows machine.